Author: Olaf Kopp
Reading time: 9 Minutes

Information gain score: How it is calculated? Which factors are crucial?

4.4/5 - (8 votes)

Information gain is one of the most exciting ranking factors for modern search engines and so SEO. Many of Information Gain’s explanations have a lack of depth and missing approaches to optimizing information gain. This article schould give a deep overview about the concept, the calculation and SEO approaches to optimize for information gain. Also the connection to phrase based indexing is explained.

This inisghts about information gain are based on fundamental knowledge of the most interesting Google patents about information gain

What is information gain in context of information retrieval and search engines?

Information gain refers to a score that indicates the additional information included in a document beyond the information contained in documents previously viewed by a user. This score helps in determining how much new information a document will provide to the user compared to what the user has already seen.

Techniques involve data from documents being applied across a machine learning model to generate an information gain score, assisting in presenting documents to the user in a manner that prioritizes those with higher scores of new information.

In information retrieval and search engines, information gain is used to evaluate the relevance and effectiveness of documents or terms in reducing uncertainty about the information needs of users. It helps in ranking documents and enhancing the overall search experience.

Entropy is a measure of uncertainty or randomness in a set of outcomes. In the context of information theory, it quantifies the amount of information needed to describe the state of a system.

A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise.

What is the role of entropy in information gain?

Entropy plays a crucial role in information gain within decision tree learning. Specifically, entropy is a measure of impurity or uncertainty in a dataset. When constructing decision trees, information gain is used to determine which attribute best separates the data into distinct classes. Information gain is calculated as the reduction in entropy that results from partitioning the data based on a given attribute.

  • Entropy: Measures impurity or randomness in data.
    • High entropy: Data is very mixed and classes are unevenly spread out.
    • Low entropy: Data is more uniform and classes are evenly spread out.
    • Maximum entropy values change with the number of classes (e.g., 2 classes: max entropy is 1, 4 classes: max entropy is 2).

The process of determining an information score

Determining an information score can follow these steps:

  1. Identify previously presented documents: The system identifies a set of documents that share a common topic and that have already been presented to the user.
  2. Identify new documents: It then identifies new documents that share the same topic but have not yet been presented to the user.
  3. Determine information gain score: For each new document, an information gain score is calculated. This score reflects the amount of new information in the document that is not present in previously presented documents.
  4. Select and present documents: Documents are selected based on their information gain scores and presented to the user. The selection and presentation can be in ranking order, with higher information gain scores being prioritized.
  5. Use in automated assistants: The automated assistant can use these scores to provide more efficient, relevant, and non-redundant information to users during an interactive session, enhancing the overall user experience.
  6. Machine learning application: The information gain score may be determined using a machine learning model that processes semantic representations of the documents to identify new information.
A search interface shows references to documents ranked based on their information gain scores. This interface allows the user to select and access documents presumed to provide the most additional information that the user has not yet obtained.
Fig. 4 illustrates a set of documents categorized based on whether or not the user has viewed them. Initially, all documents are in an unviewed state. When the user views a document, it moves from the unviewed set to the viewed set. This classification is dynamic and updates as the user interacts with more documents.

How information gain score is calculated?

Mathematically, information gain is generally calculated using the formula:

[ \text{Information Gain} = \text{Entropy of parent node} – \text{Average entropy of child nodes} ]

Steps involved in calculating Information Gain:

  1. Calculate Entropy at the Parent Node: [ H(t) = – \left(p_{C,t} \log_2(p_{C,t}) + p_{NC,t} \log_2(p_{NC,t}) \right) ] where ( p_{C,t} ) and ( p_{NC,t} ) are the probabilities of the class labels at the parent node.
  2. Calculate Entropy at Child Nodes: [ H(t_L) = – \left(p_{C,L} \log_2(p_{C,L}) + p_{NC,L} \log_2(p_{NC,L}) \right) ] [ H(t_R) = – \left(p_{C,R} \log_2(p_{C,R}) + p_{NC,R} \log_2(p_{NC,R}) \right) ] Similar to the parent node calculation but for left and right child nodes.
  3. Compute Average Entropy of Child Nodes: [ H(s,t) = P_L H(t_L) + P_R H(t_R) ] where ( P_L ) and ( P_R ) are the probabilities of the samples in the left and right child nodes relative to the parent node.
  4. Subtract Average Child Node Entropy from Parent Node Entropy: [ \text{Information Gain} = H(t) – H(s,t) ]

This formula helps in selecting the attribute that provides the greatest information gain (i.e., best splits the data) at each node of the decision tree, thereby reducing the entropy and creating more informative and discriminative splits.

How is the machine learning model trained to identify information gain?

The machine learning model is trained to identify information gain by first gathering a set of documents that have already been viewed by the user. This set, known as the first set of documents, shares a common topic. A second set of documents, which haven’t been viewed by the user but share the same topic, is identified. To determine the information gain score for these unviewed documents, data indicative of the documents (such as their contents, salient extracted information, or semantic representations) from both the first and second sets are provided as input across a trained machine learning model.

How does the machine learning model determine new versus old information?

The machine learning model determines new versus old information through a process that involves generating an information gain score for each document. The information gain score measures the amount of new information a document provides relative to the documents that the user has already viewed. Here’s how it works in detail:

  • Document Identification: The model first identifies a set of documents that the user has already viewed (first set) and another set of documents that have not yet been viewed but belong to the same topic (second set).
  • Feature Extraction: For both sets of documents, the model extracts data features such as entire content, salient information, semantic representations (like embeddings or feature vectors), etc.
    1. Entire Contents: This includes complete content analysis of the document.
    2. Salient Extracted Information: Key pieces of information extracted from the document.
    3. Semantic Representations: Including embeddings, feature vectors, bag-of-words representations, and histograms generated from words/phrases in the document.

For which areas information gain could be used for in search engines?

Information gain plays a crucial role in several areas within search engines to enhance the retrieval and ranking of relevant documents. Here are the key areas where information gain is utilized:

Information gain can be used in search engines for several key areas:

  1. Ranking Search Results: Information gain can help rank the search results by evaluating how much new or additional information a document provides compared to already viewed documents. This makes the search results more relevant and informative for the user.
  2. Filtering Redundant Information: By identifying and promoting documents with high information gain, search engines can filter out redundant documents. This helps in presenting the user with more diverse and comprehensive information.
  3. Personalizing Recommendations: Information gain can be used to personalize search results based on a user’s previous interactions, ensuring that newly presented documents add value and knowledge rather than reiterating what the user has already seen.

Examples for using information gain in information retrieval

The concept of information gain can be used in different kinds of search engines and recommendation engines.

The information gain score helps identify and present documents likely to enhance the user’s knowledge on a topic. For example, if a user is troubleshooting a computer issue, documents previously viewed by the user might cover common software solutions. New documents would be scored based on how much additional, unique information they present. A document describing hardware fixes might receive a higher score if that content wasn’t covered before. The goal is to rank and present documents based on their potential to provide new, valuable information, thereby avoiding redundancy and improving the user experience.

An automated assistant interface displays a dialog session between a user and the assistant. The interface shows turns of conversation where the assistant presents information extracted from documents according to their information gain scores, thus enhancing the user interaction.

How information gain is connected with phrase based indexing?

Information gain is closely connected with phrase-based indexing in search engines as both concepts aim to improve the relevance and accuracy of search results.

Phrase-Based Indexing

Phrase-based indexing is a technique used by search engines to improve the retrieval of relevant documents by indexing phrases instead of individual words. This method helps in understanding the context and semantics of user queries more accurately. Key aspects include:

  1. Phrase Detection:
    • Identifying and indexing common phrases and multi-word expressions from documents.
    • Phrases are more informative than single words because they capture the context and meaning better.
  2. Phrase Weighting:
    • Assigning weights to phrases based on their importance and frequency.
    • Commonly used and highly relevant phrases are given higher weights in the indexing process.
  3. Contextual Understanding:
    • By focusing on phrases, search engines can better understand the context of a query, leading to more relevant search results.
    • Phrases help in distinguishing between different meanings of the same word used in different contexts.

Connection Between Information Gain and Phrase-Based Indexing

Information gain and phrase-based indexing are closely intertwined in improving the relevance and effectiveness of search engines. Here’s how they connect, based on the documents:

1. Identification of Good Phrases Using Information Gain

Information gain is used as a predictive measure to identify good phrases from a large corpus. A phrase is considered good if it frequently co-occurs with other significant phrases beyond what is expected by chance. This helps in creating a refined list of phrases that are truly relevant and useful.

  • Co-Occurrence and Prediction: For each phrase, the system calculates the expected co-occurrence rate with other phrases and compares it with the actual co-occurrence rate. If the actual rate exceeds a threshold, the phrase is considered to have significant information gain and is retained in the good phrase list​​​​​​.
  • Thresholds: Typically, an information gain threshold between 1.1 and 1.7 is used to filter out unrelated phrases and ensure that only meaningful connections are kept​​​​.

2. Pruning and Clustering Based on Information Gain

Clusters of related phrases are identified based on high information gain values. Phrases within a cluster are related to each other and share significant informational relationships. This helps in organizing data for better search and retrieval efficiency .After identifying good phrases, the system further refines the list by removing phrases that do not predict other good phrases or are merely extensions of other phrases.

  • Pruning Incomplete Phrases: Incomplete phrases that only predict their extensions are removed to ensure that only phrases providing substantial information gain remain. For example, “President of” would be pruned unless it predicts other unique phrases beyond its extensions like “President of the United States”​​​​.
  • Clustering Related Phrases: Phrases are clustered based on high information gain between them. This helps in forming semantically meaningful groups of phrases that are often used together, enhancing the contextual relevance of search results​​​​.

3. Enhancing Search Results Using Phrase Extensions

Phrase-based indexing leverages the information gain of phrases to improve search results by suggesting or automatically searching for phrase extensions.

  • Query Expansion: When a user enters a partial phrase, the search system can use the highest information gain extensions of that phrase to suggest or perform the search. For example, a query for “President of the United” can automatically suggest “President of the United States”​​.
  • Reducing Ambiguity: By using phrases with high information gain, the system reduces ambiguity and improves the accuracy of the search results, ensuring that users find the most relevant documents​​.

4. Document Annotation and Ranking

Information gain is used to annotate documents with related phrases, which improves the ranking and relevance of search results.

  • Annotation: Documents are annotated with counts and vectors of related phrases, helping the search engine to understand the primary and secondary topics of the document. This structured data is used to rank documents more effectively based on their relevance to the query​​.
  • Ranking by Related Phrases: The documents are ranked not just by the occurrence of query phrases but also by the presence of related phrases with high information gain. This multi-layered approach ensures that documents are ranked higher if they cover the topic more comprehensively​​.

Implications for SEO

From the Google patents examined, it can be concluded that information gain is a method geared towards the individual user, providing them with ever new information on a topic and avoiding redundancies.

However, the common opinion in the SEO industry is that information gain is a user-independent ranking factor. In the end, the aim is to satisfy the individual user with new information on a topic in relation to their previously acquired knowledge.

For SEO, this means that you should not only gather information from the content that has previously ranked in the top positions, but also provide new, unique information. In addition, content should always be supplemented with new, unique information in order to maintain the information gain.

Simply curating content from the top-ranking documents will not create any information gain in any case.

To ensure that your own content offers as many users as possible an information gain, you must draw on your own experiences and also predict what information could be new to users on a topic in the future.

Some TF-IDF tools offer the option of displaying unique terms in addition to the proof terms, which can be used as a reference for aspects to ensure the uniqueness of the information.

User surveys can also provide clues as to which information is not yet covered by the documents that have been ranked so far.

Since today’s Google ranking systems are no longer only term-based, but also use sentences and entire paragraphs for a better understanding through a larger context window, TF-IDF analyses are not optimal. SEOs should also take care to structure texts clearly and use semantically related terms in the same neighborhood. This creates sections with a high salience for the respective topic.

About Olaf Kopp

Olaf Kopp is Co-Founder, Chief Business Development Officer (CBDO) and Head of SEO & Content at Aufgesang GmbH. He is an internationally recognized industry expert in semantic SEO, E-E-A-T, modern search engine technology, content marketing and customer journey management. As an author, Olaf Kopp writes for national and international magazines such as Search Engine Land, t3n, Website Boosting, Hubspot, Sistrix, Oncrawl, Searchmetrics, Upload … . In 2022 he was Top contributor for Search Engine Land. His blog is one of the most famous online marketing blogs in Germany. In addition, Olaf Kopp is a speaker for SEO and content marketing SMX, CMCx, OMT, OMX, Campixx...


Content from the blog

Helpful content: What Google really evaluates?

Since the first Helpful Content Update in 2022, the SEO world has been thinking about read more

Interesting Google patents & research papers for search and SEO in 2024

In this article I would like to contribute to archiving well-founded knowledge from Google patents read more

Information gain score: How it is calculated? Which factors are crucial?

Information gain is one of the most exciting ranking factors for modern search engines and read more

Google API Leak: Ranking factors and systems

If you delve a little deeper into the Google API leak from 2024, you will read more

What is BM25?

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance read more

LLMO: How do you optimize for the answers of generative AI systems?

As more and more people prefer to ask ChatGPT rather than Google when searching for read more