Skip to content

[BUG] Negative or Zero IDF Scores in BM25 Due to Missing Smoothing Constant #5697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
prayas7102 opened this issue Oct 10, 2024 · 3 comments
Closed
Labels

Comments

@prayas7102
Copy link
Contributor

Description

In the PR summited by me in #5615 I mistakenly did not include the smoothing constant (+1) in Inverse Document Frequency (IDF) calculation (line no. 218 in BM2InvertedIndex.java). When the smoothing constant (+1) is omitted from the IDF formula in the BM25 algorithm, it can lead to negative or zero IDF scores for common terms. This occurs because, without the +1, the value inside the logarithm can fall below 1 for terms that appear frequently in the document corpus. Negative or zero IDF scores can distort document relevance ranking, causing common terms to either contribute negatively or have no impact on the final score, resulting in inaccurate search results. Adding the +1 ensures all terms, even frequent ones, contribute positively, maintaining balanced relevance scoring.

Steps to reproduce

  1. Go To: BM2InvertedIndexTest.java
  2. Run function testSearchRanking()
  3. Test cases passes with movie: It's a Wonderful Life (docId: 6) ranked first according to the relevance score. But in accordance with the search algorithm the movie Shawshak Redemption (docId: 1) should come first.
  4. Apparently the test cases were written wrong too.
  5. The same apply to other movies in the search list.

Excepted behavior

The movie Shawshak Redemption (docId: 1) should come first instead of the movie It's a Wonderful Life (docId: 6)

Screenshots

Current Behaviour:
image

Expected Behaviour:
image

Additional context

Required PR : #5696

@prayas7102 prayas7102 added the bug label Oct 10, 2024
@shamsulalam1114
Copy link

shamsulalam1114 commented Oct 10, 2024

1.in line 218:
to prevent the IDF values:
double idf = Math.log((totalDocuments - docFrequency + 1) / (docFrequency + 1));

@SAIVARDHAN15
Copy link
Contributor

I would like to solve this issue

@prayas7102
Copy link
Contributor Author

prayas7102 commented Oct 10, 2024

1.in line 218: to prevent the IDF values: double idf = Math.log((totalDocuments - docFrequency + 1) / (docFrequency + 1));

The issue is already resolved and merged so closing it. (check PR #5696)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants