Author: Olaf Kopp
Reading time: 3 Minutes

Training machine learning models using unsupervised data augmentation

Topics: , , , ,

Rate this post

The Google patent describes a machine learning model training system that uses both labeled and unlabeled training data through semi-supervised learning. The system generates augmented training data by applying data augmentation techniques to unlabeled inputs. It then trains the model using both supervised learning (with labeled data) and unsupervised learning (with unlabeled and augmented data). The key innovation is effectively incorporating readily available unlabeled data to improve model performance while requiring fewer labeled examples. The system can be applied to various tasks like image classification, natural language processing, and speech recognition, using different types of neural network architectures appropriate for each task type.

The complete analysis of the patents, research paper and other SEO related documents and use of AI research tools are only for SEO Thought Leader (yearly), SEO Thought Leader (monthly), and SEO Thought Leader basic (yearly) members.

Your advantages:

+ Full analysis of hundreds of well researched active Microsoft and Google patents and research paper.
+ Save a lot of time and get insights in just a few minutes, without having to spend hours analyzing the documents.
+ Get quick exclusive insights about how search engines and Google could work  with easy to understand summaries and analysis.
+ All patents classified by topic for targeted research.
+ New patent summaries and analysis every week. Weekly notification via E-Mail
+ Use the AI Research Tools to gain insights from all documents in the database, the Google API Leak, Antitrust trial documents, the whole google support documents and more in seconds
+ Gain fundamental insights for your SEO work and become a real thought leader.

Get access to the SEO Research Suite and become a SEO thought leader now!
Already a member? Log in here
  • Patent ID: US12118064B2
  • Assignee: Google LLC
  • Filing date: April 24, 2020
  • Priority date: April 25, 2019
  • Notice of Allowance date: June 13, 2024
  • Inventors: Thang Minh Luong, Quoc V. Le, Qizhe Xie
  • Countries covered: United States, Europe, Japan, China

Background

The background discusses training machine learning models through semi-supervised learning, using both labeled and unlabeled training data. The system trains models by receiving inputs and generating predicted outputs based on model parameters. Neural networks, including recurrent neural networks like LSTM, use multiple layers to process inputs and generate outputs. The background highlights the importance of training models effectively while dealing with both labeled data (with known ground truth outputs) and unlabeled data (without known outputs). This sets up the context for the invention’s approach to improving model training through data augmentation techniques.

Claims

The patent describes a machine learning model training system with the following key claims:

  1. Training data includes both labeled and unlabeled inputs
  • Labeled inputs have ground truth outputs
  • Unlabeled inputs don’t have associated ground truth outputs
  1. Data augmentation process:
  • Generates augmented training inputs from unlabeled inputs
  • Applies specific augmentation techniques based on input type (images, text, etc.)
  1. Two-part training objective:
  • Unsupervised objective: Minimizes difference between model outputs for original and augmented unlabeled inputs
  • Supervised objective: Minimizes difference between model outputs and ground truth for labeled inputs
  1. Applications include:
  • Computer vision tasks (image classification)
  • Natural language processing
  • Document classification
  • Healthcare (patient diagnosis)
  1. Key advantages:
  • Improves model performance without requiring additional labeled data
  • More cost-effective than obtaining new labeled training data
  • Can leverage readily available unlabeled data
  • Enables effective semi-supervised learning

The document provides several real-world examples of applications and methodologies:

  1. Health prediction tasks using electronic health record data to predict:
  • Patient diagnoses
  • Possible future health events
  1. Speech processing tasks using audio data to determine:
  • Language classification
  • Hotword detection
  1. Image processing applications using convolutional neural networks for:
  • Image classification
  • Object detection/categorization
  • Visual data processing
  1. Text processing applications using:
  • Back translation techniques for text augmentation
  • TF-IDF based word replacement methods
  • Natural language processing tasks

How does the confidence threshold scheduling work?

The confidence threshold scheduling is a technique used during model training to prevent overfitting to limited labeled data while still learning from unlabeled data. Here’s how it works:

  1. The system uses a modified supervised objective that only includes labeled examples where the model’s confidence (probability) for the correct output is below a threshold.
  2. When the model’s confidence for a labeled example exceeds the threshold, that example is removed from the loss function (set to zero).
  3. The threshold increases gradually during training, typically starting from 1/K (where K is the number of output categories) and increasing to 1.
  4. The threshold can increase according to different schedules:
  • Logarithmic schedule
  • Linear schedule
  • Exponential schedule

The exponential schedule is particularly useful when the model is prone to overfitting, as it releases most of the supervised signal near the end of training.

What data augmentation techniques are used for text inputs?

Based on the patent document, several text data augmentation techniques are described:

  1. Back translation technique – translating text from language A to language B and then back to A to obtain augmented examples. This can be applied to randomly selected words in the input text.
  2. TF-IDF based word replacing technique – replacing words in the text based on TF-IDF (Term Frequency-Inverse Document Frequency) analysis.
  3. Simple augmentations – These are basic transformations that can be applied to improve training robustness, though they are different from the main augmentation policy.

The document emphasizes that the specific data augmentation technique used depends on the type of input the machine learning model operates on. For text inputs specifically, back translation and TF-IDF word replacement are highlighted as the main techniques.

Implications for SEO

  1. The patent discusses TF-IDF based word replacement, indicating SEO strategies could focus on identifying and preserving high TF-IDF value keywords while allowing more flexibility with low-value terms in content optimization.

COMMENT ARTICLE



Content from the blog

What is the Google Shopping Graph and how does it work?

The Google Shopping Graph is an advanced, dynamic data structure developed by Google to enhance read more

How Google can personalize search results?

The personalization of search results is one of the last steps in the ranking process read more

LLMO: How do you optimize for the answers of generative AI systems?

As more and more people prefer to ask ChatGPT rather than Google when searching for read more

The dimensions of the Google ranking

The ranking factors at Google have become more and more multidimensional and diverse over the read more

How Google evaluates E-E-A-T? 80+ signals for E-E-A-T

In 2022 I published an overview of E-E-A-T signals for the first time, which Google read more

E-E-A-T: More than an introduction to Experience ,Expertise, Authority, Trust

There are many definitions and explanations of E-E-A-T, but few are truly tangible. This article read more