Author: Olaf Kopp
Reading time: 28 Minutes

Google API Leak: Ranking factors and systems

5/5 - (1 vote)

If you delve a little deeper into the Google API leak from 2024, you will come across some frameworks and attributes that can be directly related to the Google ranking. In this article, I would like to give a brief overview of the most interesting frameworks, systems and metrics. This article is a good overview and starting point navigating through the leak by yourself or you can analyze the Leak data with the Google API Leak Analyzer.

In this article you get the relevant entry points.

Ranking relevant systems and frameworks

The ranking relevant frameworks mentioned in the leaked Google documents include:

  1. Mustang refers to various internal data structures and processes used by Google for handling, storing, and processing web references, rank embeddings, sentiment analysis, and snippet generation. These elements are integral to Google’s internal systems for organizing and ranking web content.
  2. UDR is essentially an updated and more versatile version of Mustang, offering the same functionalities but with enhancements in ease of use, integration, and future-proofing. Migration to UDR from Mustang is encouraged by Google
  3. Pianno is a sophisticated system for handling high-level page intents and detailed entity representations, providing precise and adaptable data management solutions.
  4. Kgraph is an advanced system for encoding and managing complex IQL expressions and annotations, designed to integrate seamlessly with other data annotation tools used by Google.
  5. WebRef: This framework is primarily concerned with entity annotations and their relevance. It handles explicit and implicit entity mentions, calculates confidence scores, and provides topicality scores to determine how related an entity is to the document’s main topic​​​​.
  6. Grounding Ranker (HGR): This includes features like Grounding Provider Features, which evaluate and rank providers based on a quality score. The HGR uses signals from various sources, including Scubed, a regression model incorporating multiple signals​​.
  7. Horizontal Grounding Ranker (HGR): This system evaluates policy-relevant properties of data objects and includes various signals for ranking intents and providers. It uses scores like the Provider Quality Score (PSL) for ranking purposes​​.
  8. RankEmbed: This framework focuses on embedding-based ranking features for video content, evaluating similarities between queries and content, and applying various thresholds to determine relevance and ranking​​. It stands to reason that Rankembed is closely related to the RankEmbedBERT mentioned by Pandu Nayak in the Antitrust Trial.
  9. Video Content Search: This includes various models and features for ranking video content based on criteria such as on-screen text, clustering, and generative topic predictions. It also uses ensemble models and configurations like Dolphin scores for ranking video anchors​​.
  10. QualityAuthorityTopicEmbeddingsVersionedItem: The QualityAuthorityTopicEmbeddingsVersionedItem is a model that stores versioned topic embeddings scores for a website. It includes several attributes that help in assessing the topical focus and quality of a site.
  11. NSR is a term used in Google’s internal documentation to refer to a site-level signal that indicates the quality and trustworthiness of a website. It is part of a comprehensive framework for assessing and scoring the quality of web pages and sites.
  12. Qualityboost: Qualityboost is a system used by Google to improve the quality of search results by applying various metrics and signals.

These frameworks and systems collectively contribute to determining the ranking of content in Google’s search results. They incorporate a wide range of signals, scores, and models to ensure accurate and relevant ranking.

What is mustang?

Google’s internal systems and data handling. Here are the key points:

  1. RepositoryWebrefWebrefMustangAttachment: This is a deprecated attachment type used to store WebRef entities and IQL expressions in Mustang/TG. The attachment is meant to be compact and is recommended to be accessed through a decoder rather than directly​​.
  2. MustangRankEmbedInfo: This is another type of Mustang attachment used in rank embedding. It includes several encoding types for embedding document data, such as fixed-point encoding and scaled fixed-point encoding​​.
  3. MustangSentimentSnippetAnnotations: These are annotations related to sentiment snippets in Mustang. They provide a structure for storing sentiment information, including snippet text and scores​​.
  4. MustangReposWwwSnippetsCandidateFeature: This represents a candidate feature for snippet generation, containing a name and score​​.
  5. MustangReposWwwSnippetsOrganicListSnippetResponse: This is data used to generate list previews for organic list snippets, containing various attributes like header text, item lists, and scores​​.

In summary, Mustang refers to various internal data structures and processes used by Google for handling, storing, and processing web references, rank embeddings, sentiment analysis, and snippet generation. These elements are integral to Google’s internal systems for organizing and ranking web content.

What is UDR?

UDR (Universal Data Representation) is considered a replacement for Mustang, and here are some key comparisons based on the provided documents:

  1. Feature Parity and Migration:
    • UDR has all the features that Mustang offers and can be used similarly. This includes consuming topical entities and IQL expressions.
    • Google has provided guidelines for migrating from Mustang to UDR with minimal changes, ensuring that the transition is smooth for existing systems​​​​.
  2. Usage Recommendations:
    • Mustang attachments, specifically for WebRef entities and IQL expressions, are deprecated, and new uses are not accepted. Legacy use cases are allowed but not recommended. UDR is recommended for new implementations and migration of existing Mustang-based implementations​​.
  3. Data Encoding and Storage:
    • Both systems aim to store data compactly, using packed repeated fields and variable-length integers to minimize space usage. However, UDR is designed with improved readability and extensibility compared to Mustang​​.
  4. Accessibility and Extensions:
    • While Mustang data structures were not recommended to be accessed directly, and a specific decoder library was recommended, UDR provides more accessible interfaces and is integrated with Google’s newer data management tools and systems​​.

In conclusion, UDR is essentially an updated and more versatile version of Mustang, offering the same functionalities but with enhancements in ease of use, integration, and future-proofing. Migration to UDR from Mustang is encouraged by Google

What is pianno?

Pianno is a system used by Google to handle page-level intents and compound entity representations. Here are its main functions and features:

  1. Page-Level Intents:
    • Pianno is designed to interpret the high-level purpose or topic of a web page. This helps in understanding what the page is about overall​​.
  2. Compound Entity Representations:
    • It supports complex data structures that represent entities in a detailed and interconnected manner. This is useful for tasks requiring a nuanced understanding of relationships and attributes among entities​​.
  3. IQL Expressions and Annotations:
    • Pianno works with IQL (Indexing Query Language) expressions, adding confidence scores and a sophisticated bit-mapping system to indicate top-level intents. This results in more precise annotations of web data​​​​.
  4. Integration with WebRef and Other Applications:
    • It integrates IQL expressions annotated by WebRef and other applications, enhancing the encoding of complex relationships and attributes​​.
  5. Future Use Cases:
    • Pianno is designed with future enhancements in mind, such as advanced entity linking and representation tasks. This forward-looking design makes it more adaptable to evolving needs compared to other systems like Mustang​​.

Overall, Pianno is a sophisticated system for handling high-level page intents and detailed entity representations, providing precise and adaptable data management solutions.

What is Kgraph?

Kgraph is a system used by Google for encoding IQL (Indexing Query Language) expressions and integrating them with annotations from various applications such as Webref and Pianno. Here are its key features and functionalities:

  1. IQL Expressions and Annotations:
    • Kgraph is used to encode IQL expressions that are annotated by Webref, Pianno, and other applications. These expressions are used for detailed data representation and handling within Google’s systems​​.
  2. Prototype Implementation:
    • The system is currently in a prototype phase, not yet fully deployed in production. It is designed to handle specific use cases such as Pianno page-level intents and compound entity representations​​.
  3. Attachment Encoding:
    • Kgraph attachments encode IQL FunctionCalls as byte arrays, which are then compressed. These attachments include various metadata such as confidence scores and bitmaps indicating the top-level intents for Pianno​​.
  4. Complex Relationship Handling:
    • Similar to Mustang, Kgraph supports the detailed representation of relationships and attributes among entities, enhancing the overall capability to manage and process web data efficiently​​.

In summary, Kgraph is an advanced system for encoding and managing complex IQL expressions and annotations, designed to integrate seamlessly with other data annotation tools used by Google.

What is doc joiner?

The term “doc joiner” appears multiple times across different documents and contexts. Here are the key points related to the doc joiner:

  1. Instant Doc Joiner: The doc joiner is mentioned in the context of FreshDocs and Hotdocs, specifically referring to a process that instantly joins new or updated documents into the index. This allows for real-time updates and indexing of documents as they are crawled​​.
  2. Field Propagation: The doc joiner is responsible for propagating certain fields, such as crawl pageranks from original canonical documents to the chosen canonicals. This internal process helps maintain the integrity and accuracy of document rankings during indexing​​.
  3. Data Versioning: There is also mention of indexing doc joiner data versions, indicating that this process handles different versions of data for documents, ensuring that the most current and relevant version is used for indexing and retrieval​​.

These points illustrate that the doc joiner is a critical component in Google’s indexing system, ensuring that documents are up-to-date, properly ranked, and accurately represented in the search index.

What is freshdocs?

FreshDocs refers to a component in Google’s internal indexing system that deals with instant document joining. Specifically, FreshDocs is involved in the real-time processing and updating of documents as they are crawled and indexed. It is mentioned in the context of various attributes and processes related to document indexing and ranking.

Here are the key points related to FreshDocs:

  1. Instant Doc Joiner: FreshDocs is set by the instant doc joiner, which is a process that instantly joins new or updated documents into the index​​.
  2. Crawler Context: FreshDocs doc joins do not populate certain fields because it is assumed that these fields are not needed from FreshDocs doc joins​​.
  3. Hotdocs: FreshDocs is associated with the indexing of “hotdocs,” which are likely important or frequently updated documents​​.
  4. Toolbar Data: FreshDocs also involves collecting toolbar data for indexing purposes, such as the number of distinct toolbar visitors a page had in the past day​​.

These points illustrate that FreshDocs is a crucial part of Google’s effort to keep its index updated with the most recent and relevant documents by instantly processing and integrating new data.

What is sourcetype?

The term “sourcetype” appears in several contexts within the provided documents. Here are some relevant excerpts:

  1. Image License Info: The sourcetype for copyright_notice and credit_text fields is recorded as part of the image license information​​.
  2. Entity Link Source: It is also mentioned in the context of an entity link source where the type attribute can be referred to as sourcetype​​.
  3. Repository Webref Preprocessing URL Source Info: Describes information about where a URL comes from, including a sourcetype attribute​​.

These mentions of sourcetype indicate that it is used to classify or provide metadata about the source of specific pieces of information or content within Google’s internal data models. This classification helps in tracking and managing data origins, which can be crucial for both data integrity and processing purposes.

What are providers?

In the context of the provided documents, “providers” refer to entities that offer specific types of services or content, often used in various ranking and selection processes within Google’s ecosystem. Here are some specific references to “providers” from the documents:

  1. Media Provider Info:
    • Media providers are entities that offer deeplinks to media content like radio stations, films, music, and TV programs. Each media provider has attributes such as a unique key, media ID, provider MID, and provider name​​.
  2. Provider Features for Ranking:
    • Providers have features extracted for ranking purposes, such as cluster IDs, provider ID, provider signal results, and a quality score that can be used to rank providers​​.
  3. Live TV Providers:
    • Live TV providers are categorized by attributes like provider info, provider key, and provider type, which help in identifying and differentiating various OTT or tuner-based providers​​.
  4. Cloud Provider Information:
    • For cloud-based services, providers include attributes such as a directory URL, logo URL, and a user-visible name, which are important for identifying and interacting with third-party services within Google’s Assistant platform​​.
  5. Default and Foreground Providers:
    • Providers are also marked by attributes indicating if they are the default or foreground provider for a particular task, if they are installed on the device, or if they were the last used provider for a specific intent​​.

These references highlight the multifaceted roles providers play in Google’s systems, from content delivery to user interaction facilitation.

How are provider quality scores calculated?

Provider quality scores are calculated using various attributes and signals that are gathered and processed within Google’s ranking and scoring systems. Here are some key components involved in the calculation of provider quality scores:

  1. Provider Features:
    • Features are extracted from providers for ranking purposes. These include cluster IDs, provider IDs, provider signal results, and the provider’s quality score. The quality score is used for ranking and incorporates both policy rules and quality considerations​​​​.
  2. Quality Signal Results:
    • Provider signals are extracted from the provider’s general properties (GP). These signals contribute to the overall provider quality score, which can range from 0 to 1 and is used for ranking providers​​.
  3. Experimental Data and Overrides:
    • The quality score can be influenced by experimental data and overrides. Experimental NSR team data and other experimental signals are used during live experiments (LEs) to assess the quality impact of new components or changes​​​​.
  4. Demotion and Promotion Factors:
    • Various demotion and promotion factors are considered. For example, BabyPanda demotion and nav demotion factors are converted from QualityBoost values and affect the final quality score. High-quality review pages and product reviews can also receive boosts, influencing the quality score positively​​​​.
  5. Versioned Data:
    • Versioned data scores, such as NSR scores and PairwiseQ scores, are used for continuous evaluation and experimentation with upcoming versions. These scores help in assessing quality impact on various slices​​.

By combining these various factors and signals, Google’s systems can calculate a comprehensive quality score for providers, which helps in effectively ranking and selecting the best providers for users.

WebRef

The WebRef system is a comprehensive framework for handling web entity annotations, integrating a wide range of metrics to ensure precise and efficient processing of web documents. It employs detailed statistics, metadata, and confidence scores to optimize its entity recognition and linking capabilities. This system is crucial for enhancing search accuracy and relevance by providing a structured understanding of web content.

Key Components and Metrics

  1. Entities and Annotations:
    • Entities: Annotated entities with associated confidence scores and metadata. Entities are sorted by decreasing topicality score .
    • Annotation Stats: Detailed statistics used to tune the WebRef scoring logic based on existing annotations .
    • Category Annotations: Categories of the document or query, which include confidence scores and encoded MIDs (Machine Identifiers) .
  2. Document Information:
    • Document Metadata: Includes a range of information such as geo data, video data, shingle info, semantic date info, and more .
    • WebRef Document Info: Metadata related to the document, including details like toolbar pagerank, language codes, and health scores .
  3. Metrics:
    • Confidence Scores: For entities, triples, and categories, confidence scores indicate the certainty of the annotations .
    • Processor Counters: Include various counters related to processor performance, such as the number of CPU instructions and wall time .
    • Anchor and Link Metrics: Metrics related to anchors, such as total anchor counts, redundant anchor counts, and global anchor delta .
  4. Additional Metadata:
    • Link Metadata: Information about entity relationships and link weights .
    • Range Annotations: Annotations for specific ranges within a document, like part-of-speech tags and other range-based information .
    • Triple Annotations: Inferred triples from the document which encode relationships or properties associated with entities

WebRef document metadata encompasses a wide range of attributes designed to enhance document classification, relevance, and retrieval. These attributes include geodata, WebRef entities, video data, semantic date information, toolbar PageRank, and several other metadata components crucial for document processing and ranking.

Below are the detailed components and metrics associated with WebRef document metadata:

  1. Geodata:
    • Contains geo-specific information, approximately 24 bytes for 23M U.S. pages​​.
  2. WebRef Entities:
    • Entities associated with the document, managed by the WebRef service​​.
  3. Video Data:
    • Metadata specific to videos within the document​​.
  4. Shingle Information:
    • Information about shingles (short sequences of text) within the document​​.
  5. Semantic Date Information:
    • Encoded data using a SemanticDate-specific format. It includes confidence scores for day, month, and year components, as well as various metadata for freshness evaluations​​.
  6. Ocean Data:
    • Data specific to the Ocean index, about 28 bytes per page​​.
  7. Scaled Link Age Spam Score:
    • A 7-bit integer representing the link age score, ranging from 0 to 127​​.
  8. Number of URLs:
    • Total number of URLs encoded in the URL section, including alternate URLs​​.
  9. YMYL Health Score:
    • Scores from the YMYL (Your Money or Your Life) health classifier, indicating the document’s relevance in critical life aspects like health and finance​​.
  10. Media or People Entities:
    • MIDs of the five most topical entities annotated, useful in detecting cases where search results converge mostly on a single person or media entity​​.
  11. Rosetta Languages:
    • Top two document language BCP-47 codes as generated by the RosettaLanguageAnnotator, ordered by probability​​.
  12. Toolbar PageRank:
    • Copy of the document’s Toolbar PageRank value, ranging from 0 to 10. This metric helps in evaluating the document’s importance and relevance​​.
  13. Image Data:
    • Metadata related to images within the document, including indexing information​​.
  14. Crawler ID Proto:
    • Context applied to the document during crawling, including variations of the crawler ID​​.
  15. Knex Annotation:
    • Indexing annotations for FreshDocs​​.
  16. Scaled Experiment Indy Rank:
    • Experimental ranking data​​.
  17. Premium Data:
    • Additional metadata for premium documents within the Google index​​.
  18. Tag Page Score:
    • A score representing the tag-site-ness of a page, ranging from 0 to 100​​.
  19. Video Corpus DocID:
    • Identifier for the video corpus document​​.
  20. Video Language:
    • Language of the video content, classified by Automatic Language Identification​​.
  21. Travel Good Sites Information:
    • Metadata about reputable travel sites​​.
  22. Extra Data:
    • Additional fields that are not needed during serving, including various classifier results and quality signals​​.
  23. Time Sensitivity:
    • Encoded signal indicating the document’s time sensitivity​​.
  24. Quarantine Information:
    • Bitmask of quarantine-related information, such as whitelist status and URL poisoning data​​.
  25. Phil Data:
    • Metadata related to document characteristics as defined by the PHIL system​​.
  26. Last Significant Update Information:
    • Metadata about the document’s last significant update, indicating the source of the update signal​​.
  27. ToolBar Data:
    • Specific data related to the ToolBar per document​​.
  28. Kaltix Data:
    • Metadata from the Kaltix system, used internally​​.
  29. Book Citation Data:
    • Data on book citations for each webpage, typically about 10 bytes in size​​.
  30. Commercial Score:
    • Measure of the document’s commercial nature. Scores above 0 indicate the document is commercial (i.e., involved in selling something)​​.
  31. Crowding Data:
    • Metadata related to document crowding within search results​​.
  32. Keyword Stuffing Score:
    • A score representing the extent of keyword stuffing, ranging from 0 to 127​​.
  33. Top Petacat Taxonomy ID:
    • ID of the top petacat (top-level category) of the site, used in result/query matching​​.
  34. Blog Data:
    • Specific metadata for blog documents​​.
  35. Original Title Hard Token Count:
    • Number of hard tokens in the document’s original title​​.
  36. Tundra Cluster ID:
    • Clustering information for the Tundra project, stored at the site level​​.
  37. SAFT Language Information:
    • Top document language as generated by SAFT LangID, stored as an integer​​.
  38. NSR Site Chunk:
    • Site chunk information for the NSR system​​.

Grounding Ranker (HGR)

The Grounding Ranker (HGR) system, used by Google, integrates various features from different providers to rank and improve the relevance of search results and responses in the Google Assistant. Here’s a detailed overview of the system and its metrics based on the leaked documents:

Grounding Ranker (HGR) Features and Metrics

  1. AssistantGroundingRankerContactGroundingProviderFeatures
    • conceptId: Concept ID for relationships in English, populated for relationship-based queries.
    • contactSource: Source of the contact.
    • isRelationshipFromAnnotation: Boolean indicating if the query is a relationship query based on annotation.
    • isRelationshipFromSource: Boolean indicating if the contact has a relationship in metadata.
    • isSingleCandidate: Boolean indicating if there is only a single candidate.
    • isStarred: Boolean indicating if the contact is starred.
    • matchedNameType: Type of the matched name.
    • numAlternateNameFromFuzzyContactMatch: Number of alternate contact names from fuzzy matches.
    • numAlternateNamesFromS3: Number of alternate names from S3_HYPOTHESES.
    • numAlternativeNamesFromInterpretation: Number of alternate names from interpretation.
    • numCandidates: Number of contacts populated by the contact provider.
    • recognitionAlternateSource: Source of recognition alternatives​​.
  2. AssistantGroundingRankerProviderGroundingProviderFeatures
    • providerClusterIds: Cluster IDs for the provider.
    • providerId: ID for the provider in the binding set.
    • providerSignalResult: Processed provider signals.
    • pslScore: Provider quality score used for ranking​​.
  3. AssistantGroundingRankerDeviceGroundingProviderFeatures
    • aggregateAffinity: Aggregate affinity from device contact logs.
    • callAffinity: Affinity based on call logs.
    • messageAffinity: Affinity based on message logs​​.
  4. AssistantGroundingRankerMediaGroundingProviderFeatures
    • albumReleaseType: Release type for an album.
    • ambiguityClassifier: Temporary ambiguity classifier signals.
    • entityMid: MID of the media item.
    • hasCastVideoDeeplink: Boolean indicating if the candidate has a CAST_VIDEO deeplink.
    • hasTypeSemanticEdge: Boolean indicating if the argument’s type was explicitly mentioned.
    • isCastVideo: Boolean indicating if the candidate is a YouTube CAST_VIDEO candidate.
    • isExclusiveOriginalProvider: Boolean indicating if the media item is exclusive to a provider.
    • isMediaSearchQuerySubsetOfEntityNameAndArtist: Boolean indicating if the media search query is a subset of the entity name and artists.
    • isMostRecentSongAlbumAmbiguous: Ambiguity indicator.
    • isSeedRadio: Boolean indicating if the media deeplink has a SEED_RADIO tag.
    • isSeedRadioRequest: Boolean indicating if the user requests seed radio.
    • isSelfReportedSvodProvider: Boolean indicating if the provider is self-reported.
    • isYoutubeMusicSeeking: Indicator for YouTube content seeking music.
    • mediaAccountType: Account type of the user for the provider.
    • mediaContentType: Content type from interpretation​​.
  5. KnowledgeAnswersIntentQueryGroundingSignals
    • addedByGrounding: Boolean indicating if added by grounding.
    • groundabilityScore: Score indicating how grounded the intent is.
    • numConstraints: Sum of the number of constraints used.
    • numConstraintsSatisfied: Sum of the number of constraints satisfied.
    • numGroundableArgs: Number of groundable arguments in the parsed intent.
    • numGroundedArgs: Number of arguments that got grounded.
    • numVariables: Number of arguments that the Grounding Box tried to ground.
    • numVariablesGrounded: Number of arguments that were grounded.
    • pgrpOutputFormat: PGRP output format.
    • provenance: Source provenance.
    • sentiment: Sentiment of the query.
    • usesGroundingBox: Boolean indicating if Grounding Box and PGRP are used​​.
  6. Other Features and Metrics
    • maxHgrScoreAcrossBindingSets: Maximum score assigned by HGR across all intent binding sets.
    • groundingProviderFeatures: General and specific grounding provider ranking features.
    • isNspIntent: Boolean indicating if the interpretation was generated by NSP.
    • kscorerRank: Rank order of the interpretation as determined by kscorer.
    • nspRank: Rank of the intent as reported by NSP.
    • isSageIntent: Boolean indicating if the intent was generated by Sage.
    • intentNamePauis: Intent level Pauis User Interaction Score.
    • isDummyIntent: Indicator for dummy intent.
    • isPlayGenericMusic: Indicator for PlayGenericMusic-type intent.
    • intentName: Name of the intent used by PFR ensemble model.
    • bindingSetInvalidReason: Reason for binding set invalidity.
    • isFullyGrounded: Boolean indicating if the intent is fully grounded.
    • usesGroundingBox: Indicator for grounding box usage.
    • deepMediaDominant: Indicator for deep-media dominance.
    • isHighConfidencePodcastIntent: Indicator for high confidence podcast intent.
    • subIntentType: Type of sub-intent.
    • intentNameAuisScore: Assistant User Interaction Score aggregated using intent name​​.

These features and metrics are integral to Google’s ranking algorithms, helping to ensure that the most relevant and useful results are presented to users based on their queries and interactions with the Assistant.

Horizontal Grounding Ranker (HGR)

The HGR system is designed to rank and score various grounding signals and intents generated by the Google Assistant. It uses a variety of features and metrics to determine the relevance and priority of different intents. These features are extracted from various grounding providers and are used to enhance the accuracy and relevance of the Assistant’s responses.

Key Features and Metrics

  1. Max HGR Score Across Binding Sets: This metric represents the maximum score assigned by the HGR across all of the intent’s binding sets​​.
  2. Grounding Provider Features: These include general and specific ranking features related to grounding providers. These features are essential for determining the quality and relevance of the grounding information provided by different sources​​.
  3. Intent-Related Metrics:
    • isNspIntent: Indicates whether the interpretation was generated by the Neural Semantic Parsing (NSP) system​​.
    • kscorerRank: The rank order of the interpretation as determined by the kscorer​​.
    • nspRank: The rank of the intent as reported by NSP​​.
    • isSageIntent: Indicates whether the intent was generated by Sage​​.
    • intentNamePauis: A user interaction score aggregated using the intent name​​.
    • numGroundableArgs: The number of groundable arguments the intent has, populated by the Grounding Box​​.
  4. Feasibility and Grounding:
    • isDummyIntent: Indicates if the intent is a dummy intent used for testing or fallback purposes​​.
    • isPlayGenericMusic: Indicates if the intent is a generic music-playing intent​​.
    • isFullyGrounded: Determines whether the intent is fully grounded with all required information​​.
    • usesGroundingBox: Indicates whether the interpretation should run through the Grounding Box​​.
  5. Content and Source Quality:
    • deepMediaDominant: Indicates whether the intent is dominant according to NSP deep-media analysis​​.
    • isHighConfidencePodcastIntent: Used for manual rule preference for high-confidence podcast intents over generic ones​​.
    • platinumSource: Signifies high-confidence quality if the intent comes from the Sage IntentGenerator’s “platinum” source​​.
  6. Constraint and Variable Metrics:
    • numConstraints: The total number of constraints used by the Grounding Box to ground each variable​​.
    • numGroundedArgs: The number of grounded arguments the intent has​​.
    • numVariables: The number of arguments, possibly nested, that the Grounding Box tried to ground​​.
  7. Ranking Signals:
    • isScoreBasedIntent: Indicates whether this intent relies on PFR (Prefulfillment Ranker) for scoring and pruning to the top intent​​.
    • fulfillableDominantMedia: Determines if the intent is a fulfillable, dominant media intent​​.
    • intentNameAuisScore: Assistant User Interaction Score aggregated using the intent name​​.

These features and metrics collectively enable the HGR system to effectively rank and prioritize intents, ensuring that the most relevant and high-quality responses are provided by the Google Assistant​​​​.

How does HGR improve relevance?

The Horizontal Grounding Ranker (HGR) improves relevance in Google’s Assistant system through several key mechanisms and features:

1. Max HGR Score Across Binding Sets

The HGR system assigns scores to various intent binding sets, and the maximum score among these is considered. This ensures that the most relevant intent is given higher priority based on comprehensive scoring across multiple binding sets​​.

2. Grounding Provider Features

These features include both general and specific ranking attributes related to grounding providers. By considering these features, the HGR system can accurately assess the quality and relevance of the grounding information provided by different sources. This improves the overall intent ranking by incorporating diverse signals from multiple grounding providers​​.

3. Intent Feasibility and Confidence

The HGR evaluates whether an intent is fully grounded and feasible to execute. This includes checking for the playability of media intents, high confidence podcast intents, and feasibility of fulfilling the binding set. Feasibility features are crucial for determining if an intent can be effectively executed, thus improving the relevance of the results​​.

4. Device and User Interaction Features

The HGR system incorporates features generated by the Device Targeting library and user interaction scores. These features include:

  • Device Targeting Features: Evaluates if the device is selected by Lumos as the target device.
  • User Interaction Scores: Aggregated using intent names to reflect past user interactions, thus boosting intents that align with user behavior and preferences​​.

5. Scoring Based on Popularity and Quality

The HGR uses popularity scores, listener counts for podcasts, and other similar metrics to rank intents. These scores reflect the general acceptance and quality of content, ensuring that more popular and high-quality intents are prioritized​​.

6. Dynamic Adjustment of Intent Ranks

The system can dynamically adjust intent ranks based on experimental flags and specific intent names, allowing for real-time improvements in relevance based on ongoing experiments and updates in the system. This dynamic ranking ensures that the most relevant intents are consistently prioritized​​.

7. Confidence and Validity Checks

Confidence scores from various models, such as the YouTube confidence score, ensure that intents with higher confidence are ranked higher. Validity checks, such as verifying if the intent is a high-confidence podcast or a valid smart home intent, help filter out less relevant results​​.

RankEmbed

The RankEmbed system is designed to enhance the ranking of video content by evaluating the similarities between various entities and queries. It leverages embeddings and similarity scores to rank video content effectively.

Metrics and Attributes

  1. RankEmbed Nearest Neighbors Features
    • anchorReSimilarity: Measures the similarity between the RankEmbed neighbor and the video anchor.
    • navQueryReSimilarity: Measures the similarity between the RankEmbed neighbor and the top navigation boost query of the video.
    • reSimilarity: Measures the similarity between the RankEmbed neighbor and the original query candidate​​.
  2. Video Content Search Features
    • minEntityTopicalityScore: Threshold for considering an entity from a CDoc for sourcing questions on that topic.
    • minQuestionDistance: Threshold for determining whether questions belong in the same cluster.
    • relatedQuestionsSstablePath: Path to the Related Questions SSTable that maps entities to questions.
    • spanDurationSecs: The duration threshold for merging captions​​.
  3. Multimodal Topic Training Features
    • maxFrameSimilarityInterval: Similarity info for the frame with maximum similarity to the topic in its visual interval.
    • normalizedTopic: The topic/query normalized for Navboost and QBST lookups, as well as fetching RankEmbed nearest neighbors.
    • qbstTermsOverlapFeatures: QBST terms overlap features for a candidate query.
    • rankembedNearestNeighborsFeatures: RankEmbed similarity features for a candidate nearest neighbor RankEmbed query.
    • saftEntityInfo: Information about the Saft entity annotation for this topic.
    • topicDenseVector: Raw float feature vector of the topic’s co-text embedding representation in the Starburst space​​.
  4. Mustang RankEmbed Info
    • additionalFixedPointEncodings: Contains repeated elements encoding quantized document embeddings.
    • compressedDocumentEmbedding: Quantized document embedding.
    • fixedPointEncoding: Encodes embedding types and values.
    • scaledFixedPoint4Encoding: Encodes scalar and values for embeddings.
    • scaledFixedPoint8Encoding: Similar to the above but with different encoding.
    • scaledShiftedFixedPoint4Encoding: Encodes scalar and shifted values.
    • versionAndImprovInfo: Contains version info and indexes of potential improvement queries​​.
  5. Webref Entity Metrics
    • confidenceScore: Measures the confidence of entity annotations.
    • topicalityScore: Measures the topical relevance of an entity within the content.
    • segmentMentions: Counts mentions of segments within the document.
    • isResolution: Indicates if the entity annotation is a resolution​​.

These metrics and attributes collectively help in evaluating and ranking video content by understanding the relevance and similarity of various entities and queries within the content. The RankEmbed system utilizes these features to enhance search results and provide more accurate and relevant video recommendations.

RankEmbed Similarity Scores

RankEmbed similarity scores are an integral part of the RankEmbed system, used to measure the similarity between different entities, queries, and video content. These scores help in evaluating how closely related certain pieces of information are, which in turn aids in ranking video content more effectively. Here’s a detailed explanation of the various types of RankEmbed similarity scores:

Types of RankEmbed Similarity Scores

  1. anchorReSimilarity
    • Description: This score measures the similarity between the RankEmbed neighbor and the video anchor.
    • Usage: It helps in understanding how closely a piece of content (RankEmbed neighbor) is related to the primary subject or anchor of the video. This is crucial for determining the relevance of content in relation to the main topic of the video.
  2. navQueryReSimilarity
    • Description: This score measures the similarity between the RankEmbed neighbor and the top navigation boost query of the video.
    • Usage: This score is used to assess the relevance of the RankEmbed neighbor to the most prominent query that boosts navigation for the video. It ensures that the content being ranked aligns well with popular or highly relevant queries that users are likely to use.
  3. reSimilarity
    • Description: This score measures the similarity between the RankEmbed neighbor and the original query candidate.
    • Usage: This score directly compares the RankEmbed neighbor with the original query to evaluate its relevance. It ensures that the content closely matches the user’s initial search intent.

Functionality

  • Embeddings and Nearest Neighbors: RankEmbed uses embeddings, which are vector representations of entities, queries, and content. By comparing these embeddings, RankEmbed calculates similarity scores. The nearest neighbors are those embeddings that are closest to the given query or entity in the vector space.
  • Relevance Evaluation: The similarity scores are used to evaluate the relevance of content in response to a search query. Higher similarity scores indicate a closer match, which helps in ranking the content higher in search results.
  • Ranking Optimization: By using these similarity scores, RankEmbed can optimize the ranking of video content to ensure that the most relevant and closely related content appears at the top of the search results.

Example Usage in Video Content Search

  1. Query Analysis: When a user searches for a video, RankEmbed analyzes the query and generates an embedding for it.
  2. Content Comparison: The system compares this query embedding with embeddings of various video content, calculating the anchorReSimilarity, navQueryReSimilarity, and reSimilarity scores.
  3. Ranking Decision: Based on these similarity scores, RankEmbed determines the relevance of each video and ranks them accordingly. Videos with higher similarity scores to the query are ranked higher, providing more accurate and relevant search results to the user.

Video Content Search

The Video Content Search system within Google’s infrastructure includes several detailed components and metrics used to analyze and rank video content. Below is an overview based on the provided documents:

Key Components and Metrics

  1. Video Content Metadata
    • Video Genre: Categorizes the video content by genre.
    • Video Type: Specifies the type of video (e.g., tutorial, vlog).
    • Video URL: The web address where the video is hosted.
    • Webref Entities: Entities related to the video content extracted from web references .
  2. Video Introduction Metrics
    • Has Intro: Indicates if the video has an introduction that can be skipped.
    • Intro Start and End Times: Timestamps marking the beginning and end of the skippable introduction .
  3. Video Centroid Domain Score
    • Domain: The domain from which the score was generated.
    • Number of Documents: The number of pages from the domain used to generate the score.
    • Score: Lower scores indicate the video is appearing on more diverse pages .
  4. Core Video Signals
    • Centroid: Data about the behavior of the video across pages it is embedded in.
    • Video Frames: Information about the individual frames of the video .
  5. Video OCR Features
    • Average Text Area Ratio: The ratio of text area to image area throughout the video frames.
    • Cluster ID to Frame Size: Mapping of cluster IDs to the number of frames in each cluster.
    • Duration in Milliseconds: Total length of the video.
    • Detected Language: Language detected from the video content.
    • Number of Clusters and Frames: Number of clusters and frames in the video.
    • OCR Detected Language: Language detected through OCR analysis .
  6. Video Multimodal Topic Features
    • Frame Starburst Data: Starburst vectors sorted by timestamp for multimodal topic features ​​ .
  7. Video Scoring Information
    • Common Features: Scoring features that apply to all anchor types within the video.
    • OCR Video Feature: Specific OCR-related video level features.
    • SafeSearch Classifier Output: SafeSearch’s MultiLabelClassifier output for video titles.
    • Version: Version of the VideoAnchorSets in spanner.
    • Generated Query Features: Video-level features that apply to all generated queries within the VideoAnchorSets.
    • Multimodal Topic Features: Features for multimodal topics at the video level .
  8. Additional Features
    • Title Entity Annotations: Annotations of entities found in the video title.
    • On-Screen Text Feature: Details about position, font, color, etc., of OCR text appearing on the frame.
    • Thumbnail Information: Indicators of missing Starburst embeddings or thumbnails, and thumbnail diversity score .
  9. Performance and Quality Metrics
    • ASR Language: Language information extracted from automatic speech recognition.
    • Description Anchors: Whether the video has description anchors.
    • Safe Indicator: Indicates if any anchors in the video have their “is_safe” field set to false.
    • Navqueries and NSR: Navigation queries and normalized search result metrics.
    • View Count: Number of views.
    • Duration: Video duration in milliseconds.
    • Craps Data: Data from the video content document.
    • Loudness Data: Audio information including loudness.
    • Inline Playback Metadata: Information for inline playback of the video .

Video Centroid Domain Score

The Video Centroid Domain Score is a metric used to evaluate the distribution and relevance of video content across different web pages. Here’s a detailed explanation of its components and their significance:

Components of the Video Centroid Domain Score
  1. Domain:
    • This represents the domain from which the centroid score is generated. It helps in identifying the source or the website where the video content is embedded.
  2. Number of Documents:
    • This is the count of different web pages or documents from the specified domain that contain the video. A higher number of documents indicates wider distribution of the video across that domain.
  3. Score:
    • The score is a quantitative measure that reflects how concentrated or diverse the video distribution is across different pages. A lower score typically suggests that the video is appearing on a more diverse set of pages, indicating broader relevance and engagement. Conversely, a higher score may imply that the video is concentrated in fewer documents, possibly indicating niche content or limited distribution.
Significance of the Video Centroid Domain Score
  • Distribution Analysis:
    • By analyzing the number of documents and the score, Google can assess how widely a video is distributed across a domain. This helps in understanding the video’s reach and popularity within that domain.
  • Content Relevance:
    • The score helps in determining the relevance of the video content. A lower score, indicating diverse page appearances, can be a positive signal for the video’s relevance and engagement across a wider audience.
  • Quality and Authority:
    • If a video is embedded across multiple authoritative documents within a domain, it may signal higher quality and authority of the video content, potentially influencing its ranking in search results.
Example Scenario
  • Suppose a video about “How to Tie a Tie” is embedded in several pages of a popular fashion blog. If the blog has 10 different articles featuring the video, the number of documents would be 10. If these pages are all diverse in content but still related to fashion and style, the score would be lower, indicating the video’s broad relevance within that domain.

QualityAuthorityTopicEmbeddingsVersionedItem

The QualityAuthorityTopicEmbeddingsVersionedItem is a model that stores versioned topic embeddings scores for a website. It includes several attributes that help in assessing the topical focus and quality of a site. Here are the details:

  1. pageEmbedding: This attribute stores the compressed embedding of the topics covered by an individual page.
  2. siteEmbedding: This attribute stores the compressed embedding of the topics covered by the entire site.
  3. siteFocusScore: A numerical score that indicates how focused a site is on a single topic.
  4. siteRadius: A measure of how far the page embeddings deviate from the site embedding. A smaller site radius indicates that the pages are more topically focused around the same central theme.
  5. versionId: An identifier for the version of the embeddings.

These attributes are designed to be populated into shards and copied to a superroot, and are used to provide a detailed representation of the topical focus and quality of a site based on its content and structure​​.

NSR

NSR (Normalized Spam Ratio) is a term used in Google’s internal documentation to refer to a site-level signal that indicates the quality and trustworthiness of a website. It is part of a comprehensive framework for assessing and scoring the quality of web pages and sites.

Here are some key points regarding NSR from the provided documents:

  1. Quality Signal: NSR is a measure of site quality and is used to predict the overall trustworthiness and authority of a site based on various signals and data points​​.
  2. Attributes: NSR includes several attributes such as tofu (site quality predictor based on content), healthScore, siteAutopilotScore, and others. These attributes collectively help in determining the NSR value for a site​​.
  3. Scoring and Adjustment: NSR values can be adjusted based on specific needs or in response to detected issues. For example, there is mention of nsrOverrideBid, which can override the NSR value in certain cases​​.
  4. Versioned Data: NSR scores are often versioned to allow for continuous evaluation and experimentation with different versions to assess quality impact on various slices​​.
  5. Clutter and Content Scores: Additional signals such as clutterScore, which penalizes sites with a large number of distracting resources, and articleScore for the quality of article content, also contribute to the overall NSR​​​​.

These points illustrate that NSR is a multifaceted metric used by Google to gauge the quality of websites and ensure that high-quality, trustworthy sites rank higher in search results. It incorporates a variety of signals and scores to provide a comprehensive assessment of site quality.

QualityNsrNsrData

The QualityNsrNsrData is a module within Google’s content evaluation framework. It comprises various attributes designed to measure the quality and performance of websites. Below is a detailed breakdown of each metric included in the QualityNsrNsrData:

  1. Tofu:
    • Type: number()
    • Description: A site-level quality predictor based on content quality.
  2. HealthScore:
    • Type: number()
    • Description: A categorical signal reflecting the site’s health status.
  3. SiteAutopilotScore:
    • Type: number()
    • Description: Aggregated value of URL autopilot scores for the site chunk.
  4. ClutterScore:
    • Type: number()
    • Description: A site-level signal penalizing sites with a large number of distracting or annoying resources.
  5. SitePr:
    • Type: number()
    • Description: Specific details not provided.
  6. NsrOverrideBid:
    • Type: number()
    • Description: Used to override NSR as a bid in Q*, effective when the value is greater than 0.001.
  7. ClusterUplift:
    • Type: GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrDataClusterUplift.t
    • Description: Specific details not provided.
  8. YMYLNewsV2Score:
    • Type: number()
    • Description: Specific details not provided.
  9. SmallPersonalSite:
    • Type: number()
    • Description: Score for promoting small personal sites.
  10. ArticleScoreV2:
    • Type: number()
    • Description: Specific details not provided.
  11. Pnav:
    • Type: number()
    • Description: Fractional signals related to navigation.
  12. VersionedData:
    • Type: list(GoogleApi.ContentWarehouse.V1.Model.QualityNsrNSRVersionedData.t)
    • Description: Versioned map of NSR values for experimentation with the next release.
  13. LocalityScore:
    • Type: number()
    • Description: The locality component of the LocalAuthority signal.
  14. PnavClicks:
    • Type: number()
    • Description: Denominator for the pnav computation.
  15. ShoppingScore:
    • Type: number()
    • Description: Specific details not provided.
  16. ChardVariance:
    • Type: number()
    • Description: Specific details not provided.
  17. ClusterId:
    • Type: integer()
    • Description: ID for defining clusters of sites used in ecosystem experiments.
  18. Metadata:
    • Type: GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrDataMetadata.t
    • Description: Specific details not provided.
  19. Language:
    • Type: integer()
    • Description: Specific details not provided.
  20. URL:
    • Type: String.t
    • Description: Specific details not provided.
  21. ChardEncoded:
    • Type: integer()
    • Description: Site-level chard score as a content quality predictor.
  22. NsrdataFromFallbackPatternKey:
    • Type: boolean()
    • Description: Indicates if the NSR data is from a fallback pattern key.
  23. I18nBucket:
    • Type: integer()
    • Description: Corresponds to i18n_g42_bucket.
  24. ChromeInTotal:
    • Type: number()
    • Description: Site-level Chrome views.

Example Metrics in Use

These metrics are crucial for Google’s evaluation and ranking algorithms. For instance, the Tofu score helps in determining the overall quality of a site’s content, while ClutterScore ensures that sites with intrusive elements are penalized. VersionedData allows for continuous assessment and improvement of NSR values, enabling the testing of new updates before they are fully rolled out.

NSR metrics

NSR metrics are used by Google to evaluate and score the quality of web pages and sites. These metrics help in continuous evaluation and assessment of quality impacts on various slices. Here is a detailed look at the NSR metrics:

  1. NSR 
    • Type: Number
    • Description: This represents the main NSR score used for quality evaluation.
    • Reference: nsr
  2. Prior Adjusted NSR
    • Type: List of NSRVersionedItem
    • Description: An estimate of whether the site is above or below average NSR in its slice. This metric includes prior adjustments based on historical data.
    • Reference: priorAdjustedNsr
  3. NSR Epoch
    • Type: String
    • Description: The epoch or time period from which the NSR value is derived.
    • Reference: nsrEpoch
  4. Versioned Data
    • Type: List of NSRVersionedItem
    • Description: Versioned map of NSR values used for experimenting with upcoming NSR versions.
    • Reference: versionedData
  5. Confidence Score
    • Type: Integer
    • Description: Confidence score associated with the NSR data, indicating the reliability of the NSR score.
    • Reference: nsrConfidence
  6. Site Link In
    • Type: Number
    • Description: Average value of the site link in scores for pages within a site chunk.
    • Reference: siteLinkIn
  7. Site Link Out
    • Type: Number
    • Description: Aggregated value of URL link out scores for the site chunk.
    • Reference: siteLinkOut
  8. Site Quality Standard Deviation
    • Type: Number
    • Description: Estimate of the site’s PQ rating standard deviation, representing the spread of page-level PQ ratings within a site.
    • Reference: siteQualityStddev
  9. NSR Variance
    • Type: Number
    • Description: NSR variance logodds, representing the variability of NSR scores.
    • Reference: nsrVariance
  10. Site Quality Standard Deviations
    • Type: List of QualityNsrVersionedFloatSignal
    • Description: Represents the spread of quality scores across different site chunks.
    • Reference: siteQualityStddevs
  11. Secondary Site Chunk
    • Type: String
    • Description: When present, provides more granular chunking than primary site chunks.
    • Reference: secondarySiteChunk
  12. Cluster Uplift
    • Type: QualityNsrNsrDataClusterUplift
    • Description: Uplift scores for different clusters used in ecosystem experiments.
    • Reference: clusterUplift
  13. Article Score
    • Type: Number
    • Description: Score derived from article classification of the site.
    • Reference: articleScore
  14. SpamBrain LAVC Score
    • Type: Number
    • Description: The SpamBrain Low-Authority Verification Content (LAVC) score.
    • Reference: spambrainLavcScore
  15. Health Score
    • Type: Number
    • Description: A categorical signal representing the health of the site.
    • Reference: healthScore
  16. Clutter Score
    • Type: Number
    • Description: Delta site-level signal penalizing sites with a large number of distracting resources.
    • Reference: clutterScore
  17. Product Review Promote Page
    • Type: Integer
    • Description: Indicates the likelihood of a page being promoted for high-quality product review content.
    • Reference: productReviewPPromotePage
  18. Experimental NSR Team Data
    • Type: QualityNsrExperimentalNsrTeamData
    • Description: Data used during experimental scenarios for assessing quality impacts.
    • Reference: experimentalNsrTeamData
  19. Exact Match Domain Demotion
    • Type: Integer
    • Description: Demotion applied to pages where the domain exactly matches certain negative signals.
    • Reference: exactMatchDomainDemotion
  20. SERP Demotion
    • Type: Integer
    • Description: Demotion applied within the Search Engine Results Pages (SERP).
    • Reference: serpDemotion

These metrics are part of a broader system to evaluate, score, and rank web pages and sites, ensuring that high-quality content is promoted while lower-quality content is demoted.

Qualityboost

Qualityboost is a system used by Google to improve the quality of search results by applying various metrics and signals. The detailed information about Qualityboost, including its metrics, is as follows:

Metrics Included in Qualityboost:

  1. Baby Panda Demotion
    • Type: Integer
    • Description: Demotion applied to pages that are negatively affected by the Baby Panda algorithm, indicating lower quality.
    • Reference: babyPandaDemotion
  2. Product Review Quality Page (PUhq)
    • Type: Integer
    • Description: Indicates the likelihood of a page being a high-quality product review.
    • Reference: productReviewPUhqPage
  3. NSR Versioned Data
    • Type: List of NSRVersionedItem
    • Description: Versioned NSR (New Search Ranking) scores used for continuous evaluation and quality impact assessment.
    • Reference: nsrVersionedData
  4. Experimental NSR Team Data
    • Type: QualityNsrExperimentalNsrTeamData
    • Description: Data used during experimental LEs (Live Experiments), not propagated to shards and meant for use during specific experimental scenarios.
    • Reference: experimentalNsrTeamData
  5. Navigation Demotion
    • Type: Integer
    • Description: Demotion applied to navigational pages, affecting their ranking negatively.
    • Reference: navDemotion
  6. PairwiseQ Versioned Data
    • Type: List of PairwiseQVersionedItem
    • Description: Versioned PairwiseQ scores used for evaluating quality impact on various slices.
    • Reference: pairwiseqVersionedData
  7. PQ Data Proto
    • Type: QualityNsrPQData
    • Description: Stripped page-level signals not present in the encoded field ‘pq_data’.
    • Reference: pqDataProto
  8. NSR Confidence
    • Type: Integer
    • Description: Confidence score associated with NSR data.
    • Reference: nsrConfidence
  9. Exact Match Domain Demotion
    • Type: Integer
    • Description: Demotion applied to pages where the domain exactly matches certain negative signals.
    • Reference: exactMatchDomainDemotion
  10. Experimental NSR Team WSJ Data
    • Type: List of QualityNsrExperimentalNsrTeamWSJData
    • Description: Data used during experimental scenarios, not propagated to shards, and used during LEs.
    • Reference: experimentalNsrTeamWsjData
  11. Product Review P Promote Page
    • Type: Integer
    • Description: Indicates the likelihood of a page being promoted for its product review content.
    • Reference: productReviewPPromotePage
  12. Craps New Pattern Signals
    • Type: String
    • Description: Signals related to new patterns identified in quality evaluation.
    • Reference: crapsNewPatternSignals
  13. Experimental Qstar Delta Signal
    • Type: Number
    • Description: Experimental delta signals used during specific scenarios, not propagated to shards.
    • Reference: experimentalQstarDeltaSignal
  14. Panda Demotion
    • Type: Integer
    • Description: Demotion applied due to negative impact by the Panda algorithm, indicating lower content quality.
    • Reference: pandaDemotion
  15. Anchor Mismatch Demotion
    • Type: Integer
    • Description: Demotion applied due to mismatches in anchor text signals.
    • Reference: anchorMismatchDemotion
  16. Topic Embeddings Versioned Data
    • Type: List of QualityAuthorityTopicEmbeddingsVersionedItem
    • Description: Versioned topic embeddings data used for direct scoring and quality evaluation.
    • Reference: topicEmbeddingsVersionedData

These metrics collectively aim to enhance the quality of search results by demoting low-quality content and promoting high-quality content based on a variety of signals and experimental data.

How does Baby Panda Demotion work?

​The Baby Panda Demotion is a metric used by Google’s Qualityboost system to demote pages that are negatively impacted by the Baby Panda algorithm. This demotion is converted from the QualityBoost.rendered.boost and is represented as an integer value. It works by identifying and applying a demotion score to pages that are deemed to have lower quality content based on the Baby Panda criteria. This demotion helps to ensure that such lower quality pages rank lower in search results, thereby promoting higher quality content.

Here is a summary of how Baby Panda Demotion works:

  • Conversion and Type: The Baby Panda Demotion is converted from QualityBoost.rendered.boost and is of integer type.
  • Purpose: The demotion is applied to pages that are affected negatively by the Baby Panda algorithm, indicating that these pages are of lower quality.
  • Implementation: The demotion score is applied at serving time, which means it is used dynamically when generating search results.
  • Integration with Other Metrics: It is part of a broader set of quality signals used by the Qualityboost system to evaluate and rank web pages.

Most interesting attributes and scores for ranking

Based on the leak documents, some of the most interesting attributes and scores for ranking include:

  • QualityOrbitAsteroidBeltDocumentIntentScores:
    • Intents and Scores: Stored as parallel lists for compactness. The scores are scaled between 0 and 100 for compactness​​.
    • Image Intent Scores: These are specific scores for images in the context of a landing page. Each score is also scaled between 0 and 100​​.
  • RepositoryWebrefNameScores:
    • Total Score: Describes the overall data volume for a name/source. It can be the sum of all entity scores for a name​​.
    • IDF Score: Reflects the inverse document frequency of the name, a crucial factor in determining relevance​​.
  • Goldmine Readability Score:
    • This score, along with other related metrics like Geometry Factor, Title Tag Factor, and Final Score, is used to evaluate the quality of content on a page​​.
  • RepositoryWebrefPreprocessingNameVariantSignals:
    • Prior Score/Trust: Common scores shared by all sources, providing a baseline trust level for name variants​​.
  • VideoContentSearchBleurtFeatures:
    • Candidate and Reference Texts: These features are used for BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) inference, which helps in evaluating the quality of video content based on text similarity​​.
  • Connectedness: Represents how much the entity is connected or related to other entities in the document. This signal partially influences the topicality score​​.
  • DocScore: Measures how well the document scores for the entity, serving as a relative ranking signal between different documents for an entity​​.
  • RelevanceScore: A relevance score generated by a Machine Learning entity classifier, similar to topicality but machine learning-based and supported by EntitySignals​​.
  • NormalizedTopicality: A representation of the topicality score that is normalized and represents the proportion of the document that talks about the entity​​.
  • ImageIntentScores: These are intent scores for images in the context of a landing page, scaled between 0 and 100​​.
  • PerDocRelevanceRating: Includes several attributes like content relevance and rater’s understanding of the topic, essential for document-level relevance ratings​​.
  • LinkInfo: Contains all links with scores known for an entity, crucial for the quality of the model​​.
  • NameInfo: Contains all names with scores known for an entity, important for model quality​​.
  • BleurtFeatures: Used in video content search to evaluate quality based on text similarity​​.
  • DetectedDefects: Lists defects detected in an image with confidence scores ranging from 0 to 1, where 1 indicates strong confidence that the defect exists​​.
  • QualityScore: Represents the overall quality score of an image, with a range of 0 to 1, where 1 indicates perfect quality​​.
  • FreshnessTwiddler: Considers dates-related info, such as the age of a page based on its date annotations​​.
  • LastSignificantUpdate: Timestamp of the last significant update to a document, affecting the freshness score​​.
  • Spamrank: Measures the likelihood that a document links to known spammers, with values between 0 and 65535​​.
  • QualityFeatures in Snippet Scoring:
    • ForeignMetaScore
    • HiddenRatioScore
    • NumTidbitsScore
    • NumVisibleTokensScore
    • OutlinkScore
    • RedundancyScore
    • SentenceStartScore​​.

About Olaf Kopp

Olaf Kopp is Co-Founder, Chief Business Development Officer (CBDO) and Head of SEO & Content at Aufgesang GmbH. He is an internationally recognized industry expert in semantic SEO, E-E-A-T, modern search engine technology, content marketing and customer journey management. As an author, Olaf Kopp writes for national and international magazines such as Search Engine Land, t3n, Website Boosting, Hubspot, Sistrix, Oncrawl, Searchmetrics, Upload … . In 2022 he was Top contributor for Search Engine Land. His blog is one of the most famous online marketing blogs in Germany. In addition, Olaf Kopp is a speaker for SEO and content marketing SMX, CMCx, OMT, OMX, Campixx...

COMMENT ARTICLE



Content from the blog

Helpful content: What Google really evaluates?

Since the first Helpful Content Update in 2022, the SEO world has been thinking about read more

Interesting Google patents & research papers for search and SEO in 2024

In this article I would like to contribute to archiving well-founded knowledge from Google patents read more

Information gain score: How it is calculated? Which factors are crucial?

Information gain is one of the most exciting ranking factors for modern search engines and read more

Google API Leak: Ranking factors and systems

If you delve a little deeper into the Google API leak from 2024, you will read more

What is BM25?

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance read more

LLMO: How do you optimize for the answers of generative AI systems?

As more and more people prefer to ask ChatGPT rather than Google when searching for read more