Author: Olaf Kopp
Reading time: 7 Minutes

Classifying documents using multiple classifiers

Topics: ,

Rate this post

This patent outlines a method and system for classifying documents by combining scores from multiple classifiers. Each classifier evaluates a document against a specified property (e.g., the relevance of content to a certain topic), generating scores that reflect the likelihood of the document having that property. A model then combines these scores using monotonic regression, ultimately classifying the document based on the aggregated score.

  • Patent ID: US8713007B1
  • Countries Published: United States
  • Last Publishing Date: April 29, 2014
  • Assignee: Google Inc., Mountain View, CA (US)
  • Inventors: Dmitry Korolev, Hartmut Maennel

Background

The background section of the patent  elaborates on the challenges and existing methods related to classifying documents based on certain properties, such as content relevance to specific topics or the presence of undesirable material like pornography or violence. The document discusses:

  1. Necessity for Classification: It highlights the importance of classifying documents (e.g., web pages, sites) to manage and filter content effectively. For instance, financial websites might want to prominently display documents with detailed business performance, while educational or family-oriented services might seek to exclude explicit content.
  2. Existing Classification Techniques: The background describes various approaches to document classification, including:
    • Manual Classification: Using human raters to categorize documents based on the presence of specified properties. While accurate, this method is impractical for the vast number of documents on the web due to its time-consuming nature.
    • Automatic Classifiers: Employing automated systems to identify documents likely containing specified properties by examining content types (text, images, etc.). These systems often lack the confidence level needed for decisive action, especially at a granular level (e.g., individual pages within a site).
  3. Challenges in Classification: The text details the inherent difficulties in classifying documents accurately at scale. It mentions the inadequacy of conventional automatic classifiers in providing sufficiently confident likelihoods of a document having a particular property. This challenge is compounded when actions based on classification (like content filtering) need to be made with high confidence, especially when affecting entire sites or large collections of content.
  4. Site-level vs. Page-level Classification: An important consideration is whether to classify at the site level or the individual page level. Site-level actions affect all pages of a site and thus require very high confidence, while page-level classification can be more granular but is challenging due to the lesser amount of information available for each page.

Claims

The patent US8713007B1 encompasses claims centered around methods, systems, and apparatuses for classifying documents using multiple classifiers. The claims detail a structured approach to evaluate and categorize documents based on their likelihood of possessing specified properties. Here’s a concise summary of the key claims:

  • Classifying Documents: The patent claims methods for classifying a collection of documents by applying multiple classifiers to each document. Each classifier provides a score reflecting the likelihood of the document having a specified property. These scores are combined using a model that employs monotonic regression, resulting in a classification based on the aggregated score.

The patent US 8713007 B1 mentions several types of classifiers to illustrate the diverse approaches that can be used in the process of classifying documents based on specified properties. The concrete classifiers mentioned include:

    1. Text Classifiers: These classifiers evaluate the text content of a document to determine a likelihood that the document is relevant to a specified property. For instance, a text classifier might analyze the presence and frequency of specific keywords or phrases associated with the property of interest.
    2. Image Classifiers: Image classifiers assess the image content within a document. This could involve analyzing images for certain features or patterns indicative of the specified property. For example, in the context of filtering explicit content, an image classifier might look for visual cues typical of such material.
    3. Additional Classifiers:
      • Title Text Classifiers: This type of classifier focuses specifically on the title of a document or web page, evaluating it for relevance to the specified property.
      • URL Classifiers: URL classifiers examine the structure and content of a document’s URL to infer the likelihood of the document having the specified property. For example, certain keywords or patterns in a URL might suggest a document’s content type or subject matter.

These classifiers work by examining different aspects of a document (e.g., text, images, URL) for indicators of the specified property. Each classifier generates a score reflecting the document’s likelihood of having the property, and these scores are then combined through a multiple classifier model to improve classification accuracy and reliability.

It’s important to note that while the patent outlines these specific types of classifiers as examples, the broader methodology it describes is applicable to a wide range of classifiers beyond those explicitly mentioned.

  • Document Collection: Claims include the steps of identifying and receiving documents for classification, underlining the initial phase of gathering the target documents to be classified.

Identifying Documents for Classification

    • Document Scope: The scope of documents can vary widely, encompassing web pages, digital texts, multimedia content, and more, depending on the classification objectives. The patent suggests that any digital document or content that can be evaluated for certain properties falls within the scope of collection.
    • Specified Property: The process begins by defining the specified property or properties of interest. These properties can be topics, content types, quality indicators, or other relevant characteristics that warrant classification.

Collection Methodology

    • Automated Crawling and Aggregation: Tools and systems may automatically crawl digital spaces, such as the internet or internal databases, to aggregate documents relevant to the specified property. This might involve using search algorithms, APIs, or other data access methods to gather a wide range of potential documents.
    • Filtering and Pre-Selection: Initial filtering criteria can be applied to ensure the collected documents are within the scope of interest. This pre-selection phase helps streamline the classification process by excluding documents that clearly do not meet basic relevance criteria.
    • Sampling: In some cases, especially when dealing with vast amounts of data, sampling methods may be used to select a representative subset of documents from the larger collection. This approach allows for more manageable processing and analysis without significantly compromising the diversity and coverage of the document set.

Preparing for Classification

    • Pre-Processing: Collected documents often undergo pre-processing to prepare them for classification. This can include format normalization, text extraction, metadata tagging, and other preparation steps that facilitate more effective analysis by the classifiers.
    • Annotation and Training Data Generation: For some applications, especially those involving machine learning classifiers, a portion of the document collection may be manually annotated to serve as training data. This involves assigning classification labels to documents based on human assessment, which can then be used to train or refine the classification algorithms.

Figure 3 outlines a process for identifying a subset of documents to be used as training data in developing a model for document classification. The process aims to select documents that are representative of various classifications to ensure the model is well-trained across the spectrum of possible document properties. The steps include:

      1. Identify Group of Documents and Associated Classifier Scores (300): This step involves gathering a large set of documents and obtaining scores from multiple classifiers for each document. These scores reflect the likelihood of documents having certain properties.
      2. Linearize Probability Distribution for Documents (302): Classifier scores are used to create a linear probability distribution, arranging documents in a manner that reflects their varying probabilities of possessing the specified property.
      3. Bucket Documents Based on Linear Probability Distribution (304): Documents are then grouped into buckets or segments based on their probability distribution. This segmentation helps in handling documents in a structured way, facilitating further processing.
      4. Iterate to Satisfy Constraints on Training Documents (306): The document buckets are iteratively refined to meet certain constraints, such as ensuring a balanced representation of documents across different probability ranges or properties.
      5. Select Training Documents According to Bucket (308): From each bucket, a subset of documents is selected as training data. These documents are chosen based on their ability to provide a comprehensive view of the variety of document properties being considered.
      6. Use Training Documents to Generate Model (310): The selected training documents are then used to develop or refine the classification model. This model will be capable of accurately classifying new documents based on their scores from multiple classifiers.
  • Score Combination: A critical aspect of the claims is the methodology for combining individual scores from each classifier. This includes applying a model that leverages monotonic regression to amalgamate the scores, enhancing the reliability and accuracy of the final classification.
  • Final Classification Based on Combined Score: The claims detail how the combined score is used to classify the document, including setting thresholds or criteria based on which a document is deemed to have the specified property.

Combining Classifier Scores

    • Multiple Classifiers: The system employs various classifiers, each assessing the document against a specified property from different perspectives (e.g., text content, images, URLs). Each classifier contributes a score indicating the likelihood that the document possesses the specified property.
    • Score Aggregation Method: The core of the final classification process involves aggregating these individual scores into a single, combined score. This aggregation is achieved through a model that applies monotonic regression. The method ensures that the combined score accurately reflects the overall likelihood, derived from all classifiers, that the document has the specified property.

Principles of Monotonic Regression

    • Monotonic Relationship: The aggregation model assumes a monotonic relationship between the individual scores and the likelihood of the document having the specified property. This means that higher scores from the classifiers should not decrease the combined score, aligning with the intuition that more evidence (higher scores) increases the likelihood of the property being present.
    • Regression Model: The model strategically combines the scores, possibly weighting them differently, to optimize the predictive accuracy of the combined score. It mathematically models how each score contributes to the final decision, ensuring that the combined score is a reliable indicator of the document’s classification.

Thresholds and Decision Making

    • Setting Thresholds: The final step in classification involves comparing the combined score against predefined thresholds. These thresholds are set based on the desired confidence levels for classifying a document as having or not having the specified property.
    • Binary Classification: The process effectively results in a binary classification outcome—either the document is classified as having the specified property if the combined score crosses the threshold or as not having it if the score falls short.
    • Adaptive Thresholds: The system can adjust thresholds to manage the trade-offs between precision and recall in classification, depending on the specific application or context in which the classification system is employed.

Application and Use-Cases

    • Content Filtering: In applications like search engines or content management systems, the final classification can dictate whether a document is displayed, highlighted, or filtered out based on user settings or preferences.
    • Information Retrieval: The classification outcomes can enhance information retrieval systems by enabling more nuanced search results, prioritizing documents that strongly match the user’s query and context.
    • Dynamic Learning: Over time, the system can learn from feedback (e.g., user interactions, manual reviews) to refine its classifiers and thresholds, improving the accuracy and reliability of document classification.
  • Generating and Using Document Lists: Some claims focus on creating lists of documents classified as having (or not having) the specified property. These lists can then be used in various applications, such as filtering search results or curating content for specific purposes.
  • Search and Information Retrieval Applications: The claims extend to the use of classified documents in search and information retrieval settings, where documents can be filtered or prioritized based on their classification concerning the specified property.

COMMENT ARTICLE



Content from the blog

What is the Google Knowledge Vault? How it works?

The Google Knowledge Vault was a project by Google that aimed to create an extensive read more

What is BM25?

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance read more

The dimensions of the Google ranking

The ranking factors at Google have become more and more multidimensional and diverse over the read more

Interesting Google patents for search and SEO in 2024

In this article I would like to contribute to archiving well-founded knowledge from Google patents read more

What is the Google Shopping Graph and how does it work?

The Google Shopping Graph is an advanced, dynamic data structure developed by Google to enhance read more

“Google doesn’t like AI content!” Myth or truth?

Since the AI revolution, fueled by the development of large language models (LLMs) and generative read more