How Google Analyzes and Weights Web Page Content
Martin Splitt in a Duda webinar explained a concept called Centerpiece Annotation which explains how Google analyzes the content of a web page.
I won’t reproduce the question because it’s a bit off-topic and long.
But what Martin is explaining is how Google separates the boilerplate from a webpage and then summarizes from the content structure of the text what the webpage is about.
It mentions what is called master piece annotation.
Martin Splitt explained:
“It’s just us analyzing the content and, I don’t know what we’ve said publicly about it, but I think I talked about it in one of the podcast episodes.
So I can probably say we have a thing called the centerpiece annotation, for example, and there are a few other annotations we have where we look at semantic content, as well as potentially the layout tree.
But basically, we can already read this from the content structure in HTML and understand “Oh! From all the natural language processing we’ve done on all of this textual content that we’ve gotten, it looks like it’s mostly about topic A, dog food.
Screenshot of Martin Splitt discussing centerpiece annotation
Next, Martin explains how page analysis separates the web page into components, some of which are irrelevant to the centerpiece.
Parts of the page, he explains, are weighted differently. Weight is a reference to the importance of a page element. So if a section receives a light weight score, it is not as important that it is weighted with a higher score.
“And then there’s this other thing here, which looks like links to related products, but isn’t really part of the centerpiece. It’s not really the main content here. It seems like additional stuff .
And then there’s like a bunch of boilerplate or, “Hey, we figured out that the menu is pretty much the same on all these pages and lists. It kinda looks like this menu that we have on all other pages in this domain,” for example, or we’ve seen this before. We don’t even actually go by domain or like, “Oh, that looks like a menu.”
We determine what looks like a boilerplate, and then that’s also weighted differently.
Related: 9 HTML Tags (and 11 Attributes) You Need to Know for SEO
Off-topic content less taken into account
Martin then mentions how once Google has established what a web page is, if a section is off topic, that off topic section isn’t considered as much, presumably for ranking purposes.
“So if you have content on a page that is not related to the main topic of the rest of the content, we may not give it as much attention as you think.
We still use this information for link discovery and determining your site structure and all that.
But if a page has 10,000 words about dog food, and then 3,000, 2,000, or 1,000 words about bikes, then that’s probably not good content for bikes.
This is really interesting because it seems to show that when Google determines what a page is about, off-topic content may not have a chance of ranking or, as Martin puts it, does not get “ait is a big consideration.”
Jason Barnard asked:
“So it looks to me like you’re guessing HTML5 semantics. Does semantic HTML5e help you or are you not interested in it? It’s useless?”
What Jason was referring to was the HTML5 markup that defines the different sections of a web page, like the header, navigation, footer, etc.
Early in Martin’s discussion, he was referring to the analysis of the structure of content and the text itself. So now the topic drifts a bit here in HTML5 semantic structure.
“It helps us, but it’s not the only thing we’re looking for. Yes.”
Related: Answers to 3 questions about HTML and coding
An annotation is a note that explains something. A centerpiece is something that is meant to be the center of attention.
A centerpiece annotation looks like a summary of the main content topic.
Martin explains how Google divides the page into different sections and weights the parts outside of the centerpiece annotation differently.
He also mentions that parts of a page that are different from the main topic aren’t given much consideration, which seems to mean that it might not be the content that can rank.
Essential Rendering Duda Webinar
Watch Martin Splitt explain how Google crawls a webpage at minute 28:42: