Training machine learning models using unsupervised data augmentation
Topics: AI (Deep Learning), Document Classification, LLMO, Probably in use, Retrieval Augmented Generation (RAG)
The Google patent describes a machine learning model training system that uses both labeled and unlabeled training data through semi-supervised learning. The system generates augmented training data by applying data augmentation techniques to unlabeled inputs. It then trains the model using both supervised learning (with labeled data) and unsupervised learning (with unlabeled and augmented data). The key innovation is effectively incorporating readily available unlabeled data to improve model performance while requiring fewer labeled examples. The system can be applied to various tasks like image classification, natural language processing, and speech recognition, using different types of neural network architectures appropriate for each task type.
- Patent ID: US12118064B2
- Assignee: Google LLC
- Filing date: April 24, 2020
- Priority date: April 25, 2019
- Notice of Allowance date: June 13, 2024
- Inventors: Thang Minh Luong, Quoc V. Le, Qizhe Xie
- Countries covered: United States, Europe, Japan, China
Contents
Background
The background discusses training machine learning models through semi-supervised learning, using both labeled and unlabeled training data. The system trains models by receiving inputs and generating predicted outputs based on model parameters. Neural networks, including recurrent neural networks like LSTM, use multiple layers to process inputs and generate outputs. The background highlights the importance of training models effectively while dealing with both labeled data (with known ground truth outputs) and unlabeled data (without known outputs). This sets up the context for the invention’s approach to improving model training through data augmentation techniques.
Claims
The patent describes a machine learning model training system with the following key claims:
- Training data includes both labeled and unlabeled inputs
- Labeled inputs have ground truth outputs
- Unlabeled inputs don’t have associated ground truth outputs
- Data augmentation process:
- Generates augmented training inputs from unlabeled inputs
- Applies specific augmentation techniques based on input type (images, text, etc.)
- Two-part training objective:
- Unsupervised objective: Minimizes difference between model outputs for original and augmented unlabeled inputs
- Supervised objective: Minimizes difference between model outputs and ground truth for labeled inputs
- Applications include:
- Computer vision tasks (image classification)
- Natural language processing
- Document classification
- Healthcare (patient diagnosis)
- Key advantages:
- Improves model performance without requiring additional labeled data
- More cost-effective than obtaining new labeled training data
- Can leverage readily available unlabeled data
- Enables effective semi-supervised learning
The document provides several real-world examples of applications and methodologies:
- Health prediction tasks using electronic health record data to predict:
- Patient diagnoses
- Possible future health events
- Speech processing tasks using audio data to determine:
- Language classification
- Hotword detection
- Image processing applications using convolutional neural networks for:
- Image classification
- Object detection/categorization
- Visual data processing
- Text processing applications using:
- Back translation techniques for text augmentation
- TF-IDF based word replacement methods
- Natural language processing tasks
How does the confidence threshold scheduling work?
The confidence threshold scheduling is a technique used during model training to prevent overfitting to limited labeled data while still learning from unlabeled data. Here’s how it works:
- The system uses a modified supervised objective that only includes labeled examples where the model’s confidence (probability) for the correct output is below a threshold.
- When the model’s confidence for a labeled example exceeds the threshold, that example is removed from the loss function (set to zero).
- The threshold increases gradually during training, typically starting from 1/K (where K is the number of output categories) and increasing to 1.
- The threshold can increase according to different schedules:
- Logarithmic schedule
- Linear schedule
- Exponential schedule
The exponential schedule is particularly useful when the model is prone to overfitting, as it releases most of the supervised signal near the end of training.
What data augmentation techniques are used for text inputs?
Based on the patent document, several text data augmentation techniques are described:
- Back translation technique – translating text from language A to language B and then back to A to obtain augmented examples. This can be applied to randomly selected words in the input text.
- TF-IDF based word replacing technique – replacing words in the text based on TF-IDF (Term Frequency-Inverse Document Frequency) analysis.
- Simple augmentations – These are basic transformations that can be applied to improve training robustness, though they are different from the main augmentation policy.
The document emphasizes that the specific data augmentation technique used depends on the type of input the machine learning model operates on. For text inputs specifically, back translation and TF-IDF word replacement are highlighted as the main techniques.
Implications for SEO
- The patent discusses TF-IDF based word replacement, indicating SEO strategies could focus on identifying and preserving high TF-IDF value keywords while allowing more flexibility with low-value terms in content optimization.