How does Google search (ranking) may be working today
Google has disclosed information about its ranking systems. With this information, my own thoughts and research, e.g. in Google patents, I want to put the pieces of the puzzle together in this article to form an overall picture.
I do not go into ranking factors in detail and their weighting, more into functionallity.
Disclaimer: Some assumptions in this post are based on my own thoughts and assumptions, all developed from various sources.
- 1 Why should SEOs concern themselves with how search engines / Google work?
- 2 Process steps for information retrieval, ranking and knowledge discovery at Google
- 3 Indexing and Crawling
- 4 Which Google indexes are there?
- 5 Search Query Processing
- 6 Google ranking systems
- 7 How do the different ranking systems work together?
- 8 More posts on how Google works
Why should SEOs concern themselves with how search engines / Google work?
I don’t think it makes sense to only deal with ranking factors and possible optimization tasks without understanding how a modern search engine like Google works. There are many myths and speculations in the SEO industry that are blindly followed unless you have your own ranking experience. In order to assess myths in advance, it helps to deal with the basic functioning of Google. This post should help you with that.
Process steps for information retrieval, ranking and knowledge discovery at Google
According to the explanations in the excellent lecture “How Google works: A Google Ranking Engineer’s Story” by Paul Haahr, Google distinguishes between the following process steps:
- Before a query:
- Analyzing crawled pages
- Extract links
- Render contents
- Annotate semantics
- Build an index
- Query Processing
- Query understanding
- Retrieval and scoring
- post retrieval adjustments
Indexing and Crawling
Indexing and crawling is the basic requirement for ranking, but otherwise has nothing to do with ranking content.
Google crawls the internet via bots every second. These bots are also called crawlers. The Google bots follow links to find new documents/content. But URLs that are not shown in the html code and perhaps! URLs entered directly in the Chrome browser can also be used by Google for crawling.
If the Google Bot finds new links, these are collected in a scheduler so that they can be processed later.
Domains are crawled with varying frequency and completeness, or different crawling budgets are allocated to domains. PageRank used to be an indication of the crawl intensity attributed to a domain. In addition to external links, other factors can also include publishing frequency and update frequency as well as the type of website. News pages that take place on Google News are usually crawled more frequently. According to Google, there are no problems with crawling budgets up to around 10,000 URLs. In other words, most websites have no problem being fully crawled.
Indexing takes place in two stages.
- In the first step, the pure html code is first processed with a parser in such a way that it can be transferred to an index in a resource-saving manner. In other words, the first indexed version of content is a pure html not rendered site. This saves Google time when crawling and thus also when indexing.
- In a second later step, the indexed html version is rendered, i.e. displayed like this. how the user sees it in a browser.
If Google has general problems with the indexing and crawling systems, you can monitor them in the Official Google Search Status Dashboard see.
Which Google indexes are there?
With Google, a basic distinction can be made between two types of index.
- The classic search index contains all content that Google can index. Depending on the type of content, Google also differentiates between the so-called vertical indices such as classic document index (text), image index, video index, flights, books, news, shopping, finance. The classic search index consists of thousands of shards containing millions of websites. Due to the size of the index, it is possible to compile the top n documents/content per shard very quickly due to the parallel queries of the websites in the individual shards.
- The Knowledge Graph is Google’s semantic entity index. All information about entities and their relationships to each other is recorded in the Knowledge Graph. Google obtains information about the entities from various sources.
Using natural language processing, Google is increasingly able to extract unstructured information from search queries and online content in order to identify entities or assign data to entities. With MUM, Google can not only use text sources for this, but also images, videos and audios.
For data mining Google can use both a query processor and a kind of entity processor or semantic entity API between the classic search index. (see also the Google patent “Search Result Ranking and Representation”)
Search Query Processing
The magic of interpreting search terms happens in search query processing. The following steps are important here:
- Identification of the thematic ontology in which the search query moves. If the thematic context is clear, Google can select a content corpus of text documents, videos, images … as potentially suitable search results. This is particularly difficult with ambiguous search terms. More on that in my post KNOWLEDGE PANELS & SERPS FOR AMBIGUOUS SEARCH QUERIES.
- Identification of entities and their meaning in the search term (named entity recognition)
- Semantic annotation of the search query
- Refinement of the search term
- Understanding the semantic meaning of a search query.
- Identification of the search intention
I deliberately differentiated between 2nd and 3rd here, since the search intent can vary depending on the user and can even change over time, while the lexical semantic meaning remains the same.
For certain search queries such as obvious misspellings or synonyms, a query refinement takes place automatically in the background. As a user, you can also trigger the refinement of the search query manually, insofar as Google is not sure whether it is a typo. With query refinement, a search query is rewritten in the background in order to be able to better interpret the meaning.
In addition to query refinement, query processing also involves query parsing, which enables the search engine to better understand the search query. Search queries are rewritten in such a way that search results can also be delivered that do not directly match the search query itself, but also related search queries. More on this here.
Search query processing can be performed according to the classical keyword-based term x document matching or according to an entity-based approach, depending on whether entities occur in the search query and are already recorded or not.
You can find a detailed description of Search Query Processing in the article How does Google understand search queries through Search Query Processing?
Google ranking systems
Google makes a difference here between the following ranking systems:
- AI ranking systems
- Crisis information systems
- Deduplication systems
- Exact match domain system
- Freshness system
- Helpful content system
- Link analysis systems and PageRank
- Local news systems
- Neural matching
- Original content systems
- Removal-based demotion systems
- Page experience system
- Passage Ranking system
- Product review system
- Reliable information system
- Site diversity system
- Spam detection system
- Retired Systems
- Hummingbird (has been further developed)
- Mobile friendly ranking system (now part of the Page experience system)
- Page speed system (now part of the Page experience system)
- Panda system (part of the core system since 2015)
- Penguin System (part of the Core System since 2016)
- Secure sites system (now part of the Pages experience system)
These ranking systems are used in various process steps of the Google search.
How do the different ranking systems work together?
Finally, I try to bring the large amount of information from Google about the functionality of their search engines into an overall picture.
For the interpretation of search queries, identification of search intention, query refinement, query parsing and search term document matching is a Query Processor responsible.
Of the Entity-Processor or Semantic API forms the interface between the Knowledge Graph and the classic search index. This can be used for named entity recognition and data mining for the knowledge graph or knowledge vault, e.g. via natural language processing. More on that in the post “Natural Language Processing to build a semantic database”.
For the Google ranking is the Scoring Engine, a Entity- und Sitewide Qualifier and a Ranking Engine responsible. When it comes to ranking factors, Google distinguishes between search query-dependent (e.g. keywords, proximity, synonyms…) and search query-independent (e.g. PageRank, language, page experience…) ranking factors. I would still differentiate between document-related ranking factors and domain or entity-related ranking factors.
In the Scoring Engine a relevance assessment takes place at the document level in relation to the search query. At theEntity- und Sitewide Qualifier it is about the evaluation of the publisher and/or author as well as the quality of the content as a whole in relation to themes and UX of the website (areas).
The Ranking Engine brings together the score from the scoring engine and the entity and sitewide qualifier and ranks the search results.
A Cleaning Engine sorts out duplicate content and cleans search results from content that has received a penalty.
A Personalization Layer finally, factors such as the search history or, in the case of regional search intentions, the location or other local ranking factors are taken into account.
Does that sound logical? If so, I’m happy if you share the knowledge.
More posts on how Google works
Not enough? I have been working intensively with books, Google sources and Google patents on modern search engine technologies since 2014. Here is a selection of articles I have written about it:
- Series of articles semantic SEO (only german)
- All you should know as an SEO about entity types, classes and attributes
- How Google can identity and interpret entities from unstructured content
- Googles journey to a semantic search engine
- How Google can identify and rank relevant documents via entities, NLP & vector space analysis
- Insights from the the whitepaper “How Google fights misinformation” on E-A-T and Ranking
- What is semantic search: A deep dive into entity based search
- How Google uses NLP to better understand search queries, content
- Entities and E-A-T: The role of entities in authority and trust
- 14 ways Google may evaluate E-A-T
- Most interesting Google patents for SEO from 2022
- Relevance, pertinence and quality in search engines - 9. March 2023
- How does Google search (ranking) may be working today - 4. January 2023
- Most interesting Google Patents for SEO in 2022 - 28. December 2022
- A bit more than an introduction to E-E-A-T (Experience, Expertise, Authority, Trust) - 20. December 2022
- The role of successful SEO: Consultant, interface and enabler - 29. November 2022
- All you should know as an SEO about entity types, classes & attributes - 6. August 2022
- What are Micro Intents? - 8. July 2022
- How does Google understands search terms by search query processing? - 29. June 2022
- Knowledge Panels & SERPs for ambiguous search queries - 22. May 2022
- Evolution of Marketing: From Advertising to Content – From Push to Pull - 16. May 2022