Skip to content

[FEATURE REQUEST] BM25 Inverted Index Search Algorithm #5614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
prayas7102 opened this issue Oct 6, 2024 · 2 comments
Closed

[FEATURE REQUEST] BM25 Inverted Index Search Algorithm #5614

prayas7102 opened this issue Oct 6, 2024 · 2 comments

Comments

@prayas7102
Copy link
Contributor

prayas7102 commented Oct 6, 2024

What would you like to Propose?

This PR introduces an InvertedIndex data structure that uses the BM25 scoring algorithm for ranking movie search results. The InvertedIndex class indexes movie documents based on their content (e.g., descriptions or scripts), allowing users to perform full-text searches on the movie database.

Issue details

Overview

The problem this PR addresses is the need for an efficient text search mechanism that ranks documents (in this case, movies) based on their relevance to a search term. The solution uses BM25 scoring, a popular information retrieval algorithm, to rank the search results based on term frequency, inverse document frequency (IDF), and document length.

Objectives

  • Implement an InvertedIndex for movie search.
  • Use BM25 scoring to rank search results based on term frequency, IDF, and document length.
  • Provide test coverage for functionality and edge cases.
  • Resolves the need for efficient document retrieval and ranking.

Key Features

  1. Inverted Index Structure:
    • The index maps terms (words) to document IDs and term frequencies in those documents.
    • Supports fast retrieval of documents that contain the search terms.

image

  1. BM25 Scoring Algorithm:

    • Ranks search results based on relevance.
    • Considers term frequency, document length, and average document length.
    • Two tuning parameters: k (controls term frequency saturation) and b (controls document length normalization).

    Key Components of BM25:
    1. Term Frequency (TF): TF measures how frequently a term appears in a document. Higher frequency terms are often more relevant to the document.
    2. Inverse Document Frequency (IDF): IDF is used to scale down the importance of common terms (e.g., "the", "is") and boost the importance of rare terms. The more documents a term appears in, the lower its IDF score, as it is less discriminative for determining relevance.
    image
    3. Document Length Normalization: BM25 normalizes the term frequency by the document length to prevent long documents from being unfairly penalized (since they contain more terms). The normalization is controlled by a parameter b, where 0≤𝑏≤1. If 𝑏=0, length normalization is not applied. If 𝑏=1 full length normalization is applied.
    4. Saturation of Term Frequency: BM25 applies a saturation function to the term frequency, meaning that after a certain point, the frequency of a term does not contribute linearly to the relevance score. This is controlled by the parameter 𝑘1
    , which adjusts the sensitivity to term frequency.
    image

  2. Movies:

    • Each movie has a unique docId, name, IMDb rating, release year, and content (description or script).
    • The search results are based on the content of these movies.
  3. Search Functionality:

    • Allows searching for a term, and returns a list of relevant documents (movies) with their BM25 scores.
    • The results are sorted by relevance score in descending order..

Implementation Details

  • InvertedIndex Class: Handles indexing and BM25 score calculation.
  • SearchResult Class: Stores docId and relevanceScore.
  • Movie Class: Represents a movie with relevant details and content.

Testing

  • Comprehensive test suite using JUnit 5:
    • Test for adding movies, search results, BM25 score calculation, case sensitivity, and edge cases.
    • Validate print functions for manual inspection

Additional Information

Role of Inverse Document Frequency (IDF):

The IDF component in BM25 plays a critical role in ranking because it adjusts the importance of terms based on their overall presence in the document collection. If a term appears in many documents, its IDF is low, as it is less useful for distinguishing between relevant and non-relevant documents. Conversely, rare terms get a higher IDF score, boosting the document's relevance score for those terms.

Why BM25 is Effective:

  1. Non-linear term frequency: Unlike plain TF-IDF, which increases the score linearly with term frequency, BM25 dampens this effect. This is more realistic, as repeating a word many times in a document does not infinitely increase its relevance.
  2. Document length normalization: BM25 adjusts for document length, preventing long documents from being unfairly penalized or short documents from being unfairly boosted.
  3. IDF component: It improves on raw term frequency by accounting for how informative a term is across the entire corpus, which improves the relevance of returned results.

Time and Space Complexity Analysis

1. Adding a Movie to the Index (addMovie)

When a movie is added to the index, the method processes the content of the movie and updates the inverted index.

  • Time Complexity:

    • Let N be the number of words in the movie content.
    • For each word in the movie's content:
      • Tokenization and normalization: O(N).
      • For each word, it either updates or inserts into the inverted index, which is a HashMap. Each update in the HashMap has an average time complexity of O(1) (since HashMap operations like put() are constant time on average).
      • Updating the average document length takes constant time: O(1).
    • Overall Time Complexity: O(N) per movie, where N is the number of words in the movie's content.
  • Space Complexity:

    • Each word in the movie's content is stored in the inverted index.
    • If there are M movies and each movie has an average of N words, the space required for the inverted index will be proportional to O(M * N) for storing the terms and their frequencies.
    • Each movie is also stored in a separate movies map, which takes O(M) space.
    • Overall Space Complexity: O(M * N).

2. Searching for a Term (search)

When searching for a term, the algorithm retrieves all documents that contain the term and computes the BM25 relevance score for each document.

  • Time Complexity:

    • Let D be the number of documents that contain the search term (i.e., the document frequency for the term).
    • For each document containing the term:
      • Compute the BM25 score, which involves calculating term frequency and document length (both are retrieved in O(1) time).
      • Sorting the results by relevance score takes O(D log D).
    • Overall Time Complexity: O(D log D), where D is the number of documents containing the term.
  • Space Complexity:

    • The space required to store the results is proportional to the number of documents containing the term.
    • Overall Space Complexity: O(D), where D is the number of documents that contain the search term.

3. BM25 Score Calculation (computeBM25Score)

The BM25 score calculation for a single document is a constant-time operation:

  • Time Complexity: O(1) per document.
  • Space Complexity: O(1) per document.

Overall Complexity Summary

  • Adding a Movie:

    • Time: O(N) (where N is the number of words in the movie's content).
    • Space: O(M * N) (where M is the number of movies and N is the average number of words in each movie).
  • Searching for a Term:

    • Time: O(D log D) (where D is the number of documents containing the term).
    • Space: O(D).

The performance of this implementation is efficient given that the HashMap operations for indexing and retrieving terms have average O(1) time complexity. The search is also optimized by focusing only on documents that contain the searched term.

@prayas7102
Copy link
Contributor Author

Required PR for this issue: #5615

@prayas7102
Copy link
Contributor Author

Since the required PR for this issue has been merged, I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant