Author: Olaf Kopp
Reading time: 2 Minutes

What is BM25?

5/5 - (1 vote)

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance of documents to a given search query. It belongs to a family of scoring functions known as probabilistic information retrieval models, which are based on the probabilistic relevance framework.

How BM25 Works:

BM25 calculates a score for each document relative to a specific query, where higher scores indicate a greater relevance of the document to the query. The score is based on the query terms appearing in each document, taking into account the frequency of each term in the document and across all documents in the collection. Here’s a breakdown of the main components of the BM25 formula:

  1. Term Frequency (TF): This reflects how often a query term appears in a document. More occurrences of the term usually suggest higher relevance.
  2. Inverse Document Frequency (IDF): This measures the informativeness of a term. If a term appears in many documents, it is less likely to be significant for determining relevance. The IDF component of BM25 penalizes terms that are too common across documents.
  3. Document Length Normalization: This aspect of BM25 adjusts for the length of the document. Longer documents may have higher term frequencies simply due to their length, so BM25 normalizes for this, preventing longer documents from inherently receiving higher scores unless they are more relevant.

The BM25 Formula:

The formula for BM25 is as follows:

Score=βˆ‘π‘–=1𝑛IDF(π‘žπ‘–)⋅𝑓(π‘žπ‘–,𝐷)β‹…(π‘˜1+1)𝑓(π‘žπ‘–,𝐷)+π‘˜1β‹…(1βˆ’π‘+π‘β‹…βˆ£π·βˆ£avgdl)

where:

  • π‘žπ‘– is a query term,
  • 𝑓(π‘žπ‘–,𝐷) is π‘žπ‘–‘s term frequency in the document 𝐷,
  • ∣𝐷∣ is the length of the document,
  • avgdl is the average document length in the text collection,
  • π‘˜1 and 𝑏 are free parameters, usually chosen empirically (common values are π‘˜1=2.0 and 𝑏=0.75),
  • IDF(π‘žπ‘–) is the IDF for π‘žπ‘–.

Applications and Usage:

BM25 is widely used in search engines and various information retrieval applications due to its effectiveness and efficiency. It is particularly well-regarded for its balance between simplicity and performance, making it a foundational component in many modern search systems, including those that use more complex machine learning models.

In summary, BM25 is a robust method for scoring documents based on their relevance to a query, efficiently balancing term frequency, document frequency, and document length.

About Olaf Kopp

Olaf Kopp is Co-Founder, Chief Business Development Officer (CBDO) and Head of SEO & Content at Aufgesang GmbH. He is an internationally recognized industry expert in semantic SEO, E-E-A-T, modern search engine technology, content marketing and customer journey management. As an author, Olaf Kopp writes for national and international magazines such as Search Engine Land, t3n, Website Boosting, Hubspot, Sistrix, Oncrawl, Searchmetrics, Upload … . In 2022 he was Top contributor for Search Engine Land. His blog is one of the most famous online marketing blogs in Germany. In addition, Olaf Kopp is a speaker for SEO and content marketing SMX, CMCx, OMT, OMX, Campixx...

COMMENT ARTICLE



Content from the blog

What is the Google Knowledge Vault? How it works?

The Google Knowledge Vault was a project by Google that aimed to create an extensive read more

What is BM25?

BM25 is a popular ranking function used in information retrieval systems to estimate the relevance read more

The dimensions of the Google ranking

The ranking factors at Google have become more and more multidimensional and diverse over the read more

Interesting Google patents for search and SEO in 2024

In this article I would like to contribute to archiving well-founded knowledge from Google patents read more

What is the Google Shopping Graph and how does it work?

The Google Shopping Graph is an advanced, dynamic data structure developed by Google to enhance read more

“Google doesn’t like AI content!” Myth or truth?

Since the AI revolution, fueled by the development of large language models (LLMs) and generative read more