Author: Olaf Kopp
Reading time: 7 Minutes

Phrase-based indexing in an information retrieval system

Topics: , ,

Rate this post

The patent describes a method for indexing, retrieving, organizing, and describing documents in an information retrieval system, primarily focusing on the use of phrases. It innovates by identifying phrases that predict the presence of other phrases in documents, allowing documents to be indexed based on these identified phrases. The system also identifies related phrases and extensions, utilizes phrases in queries to retrieve and rank documents, clusters documents in search results, generates document descriptions, and removes duplicate documents from search results and indexes.

Patent ID: US7536408B2
Countries: United States
Date of Patent: May 19, 2009
Inventor: Anna Lynn Patterson, San Jose, CA, USA
Assignee: Google Inc., Mountain View, CA, USA

It is likely belonging to the family of the patent “Phrase-based searching in an information retrieval system”.

Patent ID: US9990421B2
Countries: United States
Date of Patent: June 5, 2018
Inventor: Anna L. Patterson

Applicant/Assignee: Google LLC, Mountain View, CA


The background of the patent US 7536408 B2 addresses the limitations of traditional information retrieval systems, which primarily rely on indexing documents based on individual words. This approach often fails to capture the nuanced meaning conveyed by phrases, leading to less relevant search results. Traditional systems might not effectively identify documents that don’t contain the exact query terms but are related by context or meaning. For instance, a search for “Australian Shepherds” might miss relevant documents about similar herding dogs or fetch unrelated documents about Australia or shepherds in general.

The document highlights the computational and memory challenges perceived in indexing by phrases due to the vast number of potential combinations and the dynamic nature of language, where new phrases constantly emerge and others become obsolete. Some systems attempt to address concept retrieval through co-occurrence patterns of individual words, but this approach often misses the richer context provided by phrases.

Given these challenges, there’s a recognized need for an information retrieval system that can more effectively identify, index, and search documents based on phrases, thereby capturing the fuller context and meaning of the content. The background sets the stage for introducing a novel phrase-based indexing and retrieval approach aimed at overcoming the limitations of traditional keyword-based systems.


The patent includes a set of claims focused on the methodology and system for phrase-based indexing and information retrieval, offering a detailed blueprint on how to enhance document organization, searchability, and relevance in a database, such as the web. Here’s a summary of the key claims:

  • Predictive Phrase Identification: The patent describes methods for identifying phrases that predict the presence of other phrases within documents, improving the system’s ability to index and search documents contextually. In analyzing documents, the system might identify “climate change” as a predictive phrase that often co-occurs with phrases like “global warming,” “carbon emissions,” and “sea-level rise.” This predictive relationship enables the system to index and retrieve documents more contextually when users search for topics related to climate change.

  • Utilization of Related Phrases and Extensions: It includes claims on identifying related phrases and phrase extensions, enriching the indexing process and enhancing search query responses with more relevant results. Documents are indexed not just by the phrases they contain, but also by related phrases and phrase extensions. This approach allows for a richer understanding of the document’s content, enabling the system to rank documents more effectively by considering the broader context and variations of phrase usage. If a user searches for “quantum computing,” the system might extend this query to include related phrases like “quantum encryption,” “qubits,” and “superposition,” even if the original query did not explicitly mention these terms. This ensures a broader and more relevant set of documents is retrieved.
  • Search Query Handling: Claims cover the use of identified phrases in processing search queries, including the retrieval and ranking of documents based on their relevance to the query phrases. When processing search queries, the system identifies phrases within the query and uses these phrases to retrieve and rank documents. This includes considering both the direct phrases found in the query and their related phrases or extensions, allowing for a broader and more nuanced retrieval of relevant documents.
  • Document Clustering and Description Generation: The patent claims methods for using phrases to cluster documents within search results and to generate concise document descriptions, facilitating a better user experience by providing quick insights into document content. The patent details a method for clustering documents based on the phrases they contain. This method uses the identified phrases and their relationships to group documents that cover similar topics or concepts, aiding users in navigating search results by categorizing documents into meaningful clusters. In presenting search results for “Renewable energy sources,” the system might cluster documents into subcategories like “solar energy,” “wind power,” and “hydroelectric energy” based on the occurrence and relationships of phrases within those documents, helping users to navigate through the results more efficiently.To further organize and present search results, the system generates brief descriptions of documents based on the phrases and related phrases they contain. These descriptions give users quick insights into the document’s content, aiding in the selection process by highlighting relevant phrases and their context.

  • Duplicate Document Elimination: There are claims related to identifying and eliminating duplicate documents from search results and the index, thereby maintaining a cleaner and more efficient database. As part of its organization features, the system identifies and removes duplicate documents from search results and the indexing database. This process ensures that the search results are not cluttered with redundant information, making it easier for users to find unique and relevant content. When multiple documents from different sources contain the same text about “The Life of Marie Curie,” the system identifies and removes these duplicates from search results, ensuring users are presented with unique content for their queries.
  • Ranking Based on Phrase Predictiveness: The system ranks documents by analyzing the predictiveness of phrases within them. Phrases that are identified to predict the presence of other phrases in documents contribute to a document’s relevance score. This predictiveness is measured through a statistical analysis of phrase co-occurrence, with phrases that have a high predictive value contributing more significantly to the ranking.
  • Ranking Personalization: The document hints at the possibility of personalizing document rankings based on the user’s history or preferences, suggesting that the system could adapt search results to better match an individual’s specific interests or previous interactions with the search engine.

Figure 3 illustrates an example of phrase-based indexing within a text document, specifically focusing on how phrases are identified and processed in the context of the patent’s information retrieval system. The figure depicts a portion of a document that discusses the history and origin of the Australian Shepherd dog breed. This example is used to demonstrate the identification of phrases and the potential application of a phrase window and a secondary window in analyzing document text for phrase-based indexing.

A visual representation of a “phrase window” that moves through the document text. This window identifies potential phrases for indexing by capturing a sequence of words. The example shows how the phrase window spans a specific number of words (e.g., “stock dogs for the Basque shepherds”), highlighting how phrases of various lengths can be identified.

Alongside the primary phrase window, a secondary window is depicted, which extends around the identified phrase. This secondary window is used to analyze the context surrounding the phrase, looking for related phrases or terms that might predict or be predicted by the phrase within the window. This process is crucial for understanding the semantic relationships between phrases within documents.

The figure demonstrates how phrases are identified within the text, including both the primary phrase under analysis and potentially related phrases within the secondary window. This identifies not just isolated phrases but also their context and connection to other phrases, enhancing the document’s indexing and retrieval based on related semantic concepts.

Through the use of the phrase and secondary windows, the system analyzes the document’s context, enabling a deeper understanding of how phrases are used and related within the text. This allows the information retrieval system to more accurately index documents based on the semantic richness of their content, rather than on individual keywords alone.

Implications for SEO

1. Phrase-Based Content Optimization:

SEO strategies should evolve to emphasize phrase-based content optimization. Instead of targeting isolated keywords, content creators and SEO professionals should ensure that content naturally incorporates relevant phrases and their variations that users might search for. This approach aligns with the system’s emphasis on understanding and indexing content based on phrases, enhancing the likelihood of being ranked for relevant searches.

2. Semantic Relevance and Context:

The patent underscores the importance of semantic relevance and context in content creation. SEO efforts should focus on developing content that covers topics comprehensively, incorporating phrases that are semantically related or predictively linked. This not only improves the content’s value to users but also aligns with the system’s method of identifying and ranking documents based on the contextual relationships between phrases.

3. Long-Tail Keywords and Phrases:

Given the system’s ability to index and retrieve documents based on phrase predictiveness and extensions, optimizing for long-tail keywords—longer and more specific phrases that users are likely to use when searching—becomes increasingly important. Long-tail phrases often capture user intent more accurately and can lead to more qualified traffic, aligning well with the patent’s methodology.

4. Content Clustering and Structuring:

The concept of document clustering as described in the patent suggests an opportunity for SEO through structured content and site architecture. By organizing content into clearly defined clusters or categories based on related phrases, websites can mimic the system’s document clustering approach, potentially improving site navigability and relevance for specific phrase-based queries.

5. Unique and Comprehensive Content:

To avoid the elimination of duplicates and ensure content stands out, the patent advises the creation of unique and comprehensive content that covers topics in depth, using a variety of related phrases and contextually relevant information. This not only aids in avoiding content duplication but also positions the content as a valuable resource for both users and search engines.

6. Meta Descriptions and Snippets:

With the system generating document descriptions based on phrases, crafting compelling meta descriptions and snippets that include key phrases becomes crucial. These elements should accurately summarize the content’s main topics and incorporate relevant phrases to capture the system’s and users’ attention in search results.

7. Adapting to User Query Patterns:

Understanding how users phrase their queries and the types of phrase extensions they might search for can guide content optimization. SEO strategies should be flexible and adapt to changing user behaviors and query patterns, reflecting the system’s dynamic approach to phrase identification and indexing.


Content from the blog

What is the Google Knowledge Vault? How it works?

The Google Knowledge Vault was a project by Google that aimed to create an extensive read more

What is BM25?

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance read more

The dimensions of the Google ranking

The ranking factors at Google have become more and more multidimensional and diverse over the read more

Interesting Google patents for search and SEO in 2024

In this article I would like to contribute to archiving well-founded knowledge from Google patents read more

What is the Google Shopping Graph and how does it work?

The Google Shopping Graph is an advanced, dynamic data structure developed by Google to enhance read more

“Google doesn’t like AI content!” Myth or truth?

Since the AI revolution, fueled by the development of large language models (LLMs) and generative read more