What is BM25?
BM25 is a popular ranking function used in information retrieval systems to estimate the relevance of documents to a given search query. It belongs to a family of scoring functions known as probabilistic information retrieval models, which are based on the probabilistic relevance framework.
How BM25 Works:
BM25 calculates a score for each document relative to a specific query, where higher scores indicate a greater relevance of the document to the query. The score is based on the query terms appearing in each document, taking into account the frequency of each term in the document and across all documents in the collection. Here’s a breakdown of the main components of the BM25 formula:
- Term Frequency (TF): This reflects how often a query term appears in a document. More occurrences of the term usually suggest higher relevance.
- Inverse Document Frequency (IDF): This measures the informativeness of a term. If a term appears in many documents, it is less likely to be significant for determining relevance. The IDF component of BM25 penalizes terms that are too common across documents.
- Document Length Normalization: This aspect of BM25 adjusts for the length of the document. Longer documents may have higher term frequencies simply due to their length, so BM25 normalizes for this, preventing longer documents from inherently receiving higher scores unless they are more relevant.
The BM25 Formula:
The formula for BM25 is as follows:
where:
- ππ is a query term,
- π(ππ,π·) is ππ‘s term frequency in the document π·,
- π· is the length of the document,
- avgdl is the average document length in the text collection,
- π1 and π are free parameters, usually chosen empirically (common values are π1=2.0 and π=0.75),
- IDF(ππ) is the IDF for ππ.
Applications and Usage of BM25
BM25 is widely used in search engines and various information retrieval applications due to its effectiveness and efficiency. It is particularly well-regarded for its balance between simplicity and performance, making it a foundational component in many modern search systems, including those that use more complex machine learning models.
In summary, BM25 is a robust method for scoring documents based on their relevance to a query, efficiently balancing term frequency, document frequency, and document length.
Difference between BM25 and TF-IDF
The difference between BM25 (Best Matching 25) and TF-IDF (Term Frequency-Inverse Document Frequency) lies mainly in how they evaluate the relevance of documents concerning a search query. Here are the main differences:
1. Calculation and Weighting of Terms
TF-IDF:
- Term Frequency (TF): Measures how often a term appears in a document. The more frequently a term appears, the higher its weighting.
- Inverse Document Frequency (IDF): Measures how rare a term is across the entire document collection. Rare terms have a higher weighting as they are considered more relevant.
The TF-IDF weighting is calculated as:
BM25:
- BM25 is an extension of TF-IDF that introduces additional parameters to make the weighting more flexible and adaptive.
- BM25 uses a saturated frequency function for Term Frequency (TF), considering that the relevance of a term does not increase linearly with its frequency.
- BM25 also takes into account the length of documents and normalizes them to avoid penalizing longer documents.
The BM25 weighting is calculated as:
2. Adaptability and Relevance Scoring
TF-IDF:
- Relatively simple and straightforward.
- Suitable for smaller or less complex document collections.
- The weighting is based solely on term frequency and inverse document frequency.
BM25:
- More flexible and adaptive due to the use of hyperparameters k1k_1 and bb, which control term frequency saturation and document length normalization.
- Generally provides better results for larger and more complex document collections, especially in information retrieval.
- Considers not only the frequency of a term but also the document length and term saturation.
Summary
While TF-IDF is a simple and intuitive method for weighting terms based on their frequency and rarity, BM25 offers an advanced and fine-tuned method that considers additional factors such as document length and frequency saturation. As a result, BM25 is often better suited for more complex applications in information retrieval.
- LLMO / Generative Engine Optimization: How do you optimize for the answers of generative AI systems? - 10. February 2025
- Prompt Engineering Guide: Tutorial, best practises, examples - 27. January 2025
- Overview: Brand Monitoring Tools for LLMO / Generative Engine Optimization - 20. January 2025
- What is the Google Shopping Graph and how does it work? - 4. December 2024
- How Google can personalize search results? - 1. December 2024
- The dimensions of the Google ranking - 9. November 2024
- How Google evaluates E-E-A-T? 80+ signals for E-E-A-T - 4. November 2024
- E-E-A-T: More than an introduction to Experience ,Expertise, Authority, Trust - 4. November 2024
- Case Study: 1400% visibility increase in 6 months through E-E-A-T of the source entity - 24. September 2024
- The most important ranking methods for modern search engines - 2. September 2024