Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Topics: AI Mode, LLMO / GEO
This Google research paper introduces a new framework for evaluating the effectiveness of similarity metrics, specifically text embeddings, in the process of data curation for Language Model (LM) pretraining. The authors argue that generic, off-the-shelf embedding models, typically trained for tasks like retrieval, are often poorly suited for selecting high-quality and diverse pretraining data. They propose three evaluation criteria—correlation with pretraining loss (difficulty), utility in a diversity-based data selection process, and ability to distinguish between data sources—and demonstrate that simple, specialized embeddings derived from models trained on the same corpus significantly outperform more complex, general-purpose alternatives.
