Author: Olaf Kopp
Reading time: 3 Minutes

What is BM25?

5/5 - (2 votes)

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance of documents to a given search query. It belongs to a family of scoring functions known as probabilistic information retrieval models, which are based on the probabilistic relevance framework.

How BM25 Works:

BM25 calculates a score for each document relative to a specific query, where higher scores indicate a greater relevance of the document to the query. The score is based on the query terms appearing in each document, taking into account the frequency of each term in the document and across all documents in the collection. Here’s a breakdown of the main components of the BM25 formula:

  1. Term Frequency (TF): This reflects how often a query term appears in a document. More occurrences of the term usually suggest higher relevance.
  2. Inverse Document Frequency (IDF): This measures the informativeness of a term. If a term appears in many documents, it is less likely to be significant for determining relevance. The IDF component of BM25 penalizes terms that are too common across documents.
  3. Document Length Normalization: This aspect of BM25 adjusts for the length of the document. Longer documents may have higher term frequencies simply due to their length, so BM25 normalizes for this, preventing longer documents from inherently receiving higher scores unless they are more relevant.

The BM25 Formula:

The formula for BM25 is as follows:

where:

  • π‘žπ‘– is a query term,
  • 𝑓(π‘žπ‘–,𝐷) is π‘žπ‘–‘s term frequency in the document 𝐷,
  • 𝐷 is the length of the document,
  • avgdl is the average document length in the text collection,
  • π‘˜1 and 𝑏 are free parameters, usually chosen empirically (common values are π‘˜1=2.0 and 𝑏=0.75),
  • IDF(π‘žπ‘–) is the IDF for π‘žπ‘–.

Applications and Usage of BM25

BM25 is widely used in search engines and various information retrieval applications due to its effectiveness and efficiency. It is particularly well-regarded for its balance between simplicity and performance, making it a foundational component in many modern search systems, including those that use more complex machine learning models.

In summary, BM25 is a robust method for scoring documents based on their relevance to a query, efficiently balancing term frequency, document frequency, and document length.

Difference between BM25 and TF-IDF

The difference between BM25 (Best Matching 25) and TF-IDF (Term Frequency-Inverse Document Frequency) lies mainly in how they evaluate the relevance of documents concerning a search query. Here are the main differences:

1. Calculation and Weighting of Terms

TF-IDF:

  • Term Frequency (TF): Measures how often a term appears in a document. The more frequently a term appears, the higher its weighting.
  • Inverse Document Frequency (IDF): Measures how rare a term is across the entire document collection. Rare terms have a higher weighting as they are considered more relevant.

The TF-IDF weighting is calculated as:

BM25:

  • BM25 is an extension of TF-IDF that introduces additional parameters to make the weighting more flexible and adaptive.
  • BM25 uses a saturated frequency function for Term Frequency (TF), considering that the relevance of a term does not increase linearly with its frequency.
  • BM25 also takes into account the length of documents and normalizes them to avoid penalizing longer documents.

The BM25 weighting is calculated as:

2. Adaptability and Relevance Scoring

TF-IDF:

  • Relatively simple and straightforward.
  • Suitable for smaller or less complex document collections.
  • The weighting is based solely on term frequency and inverse document frequency.

BM25:

  • More flexible and adaptive due to the use of hyperparameters k1k_1 and bb, which control term frequency saturation and document length normalization.
  • Generally provides better results for larger and more complex document collections, especially in information retrieval.
  • Considers not only the frequency of a term but also the document length and term saturation.

Summary

While TF-IDF is a simple and intuitive method for weighting terms based on their frequency and rarity, BM25 offers an advanced and fine-tuned method that considers additional factors such as document length and frequency saturation. As a result, BM25 is often better suited for more complex applications in information retrieval.

About Olaf Kopp

Olaf Kopp is an online marketing expert for Generative Engine Optimization (GEO) and SEO. He has over 15 years of experience in Google Ads, SEO, and content marketing. Olaf Kopp is one of the early pioneers in the fields of Generative Engine Optimization (GEO) and digital brand building, and the inventor of modern GEO and marketing concepts such as LLM readability, brand context optimization, and digital authority management. Olaf Kopp is Co-Founder, Chief Business Development Officer (CBDO) and Head of SEO & AI Search (GEO) at Aufgesang GmbH. He is an internationally recognized industry expert in semantic SEO, E-E-A-T, LLMO & Generative Engine Optimization (GEO), AI- and modern search engine technology, content marketing and customer journey management. Olaf Kopp is one of the first pioneers worldwide to have demonstrably worked on the topics of Generative Engine Optimization (GEO) and Large Language Model Optimization (LLMO). His first publications date back to 2023. As an author, Olaf Kopp writes for national and international magazines such as Search Engine Land, t3n, Website Boosting, Hubspot, Sistrix, Oncrawl, Searchmetrics, Upload … . In 2022 he was Top contributor for Search Engine Land. His blog is one of the most famous online marketing blogs in Germany. In addition, Olaf Kopp is a speaker for SEO and content marketing SMX, SERP Conf., CMCx, OMT, OMX, Campixx...

COMMENT ARTICLE



Content from the blog

Brand Context Optimization: A Practical Step-by-Step Guide

This guide helps you systematically optimize how AI systems (LLMs like ChatGPT, Gemini, Perplexity) and read more

Brand Identity Blocks for Brand Context Optimization

In this article, I would like to introduce you to the concept of brand identity read more

What is brand context optimization for GEO?

Brand context optimization is a strategic process of Generative Engine Optimization (GEO) that aims to read more

Brand Context Optimization: How to Write Text About Your Brand (for Companies, Persons and Products)

Search engines and large language models extract structured facts from your text β€” parsing sentences, read more

Guide to Brand Context Optimization for Generative Engine Optimization (GEO)

In many discussions about generative engine optimization, too little distinction is made between the different read more

Ultimate guide for llm readability optimization and better chunk relevance

In many discussions about generative engine optimization, too little distinction is made between the different read more