You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the PR summited by me in #5615 I mistakenly did not include the smoothing constant (+1) in Inverse Document Frequency (IDF) calculation (line no. 218 in BM2InvertedIndex.java). When the smoothing constant (+1) is omitted from the IDF formula in the BM25 algorithm, it can lead to negative or zero IDF scores for common terms. This occurs because, without the +1, the value inside the logarithm can fall below 1 for terms that appear frequently in the document corpus. Negative or zero IDF scores can distort document relevance ranking, causing common terms to either contribute negatively or have no impact on the final score, resulting in inaccurate search results. Adding the +1 ensures all terms, even frequent ones, contribute positively, maintaining balanced relevance scoring.
Steps to reproduce
Go To: BM2InvertedIndexTest.java
Run function testSearchRanking()
Test cases passes with movie: It's a Wonderful Life (docId: 6) ranked first according to the relevance score. But in accordance with the search algorithm the movie Shawshak Redemption (docId: 1) should come first.
Apparently the test cases were written wrong too.
The same apply to other movies in the search list.
Excepted behavior
The movie Shawshak Redemption (docId: 1) should come first instead of the movie It's a Wonderful Life (docId: 6)
Description
In the PR summited by me in #5615 I mistakenly did not include the smoothing constant (+1) in Inverse Document Frequency (IDF) calculation (line no. 218 in BM2InvertedIndex.java). When the smoothing constant (+1) is omitted from the IDF formula in the BM25 algorithm, it can lead to negative or zero IDF scores for common terms. This occurs because, without the +1, the value inside the logarithm can fall below 1 for terms that appear frequently in the document corpus. Negative or zero IDF scores can distort document relevance ranking, causing common terms to either contribute negatively or have no impact on the final score, resulting in inaccurate search results. Adding the +1 ensures all terms, even frequent ones, contribute positively, maintaining balanced relevance scoring.
Steps to reproduce
testSearchRanking()
Excepted behavior
The movie Shawshak Redemption (docId: 1) should come first instead of the movie It's a Wonderful Life (docId: 6)
Screenshots
Current Behaviour:

Expected Behaviour:

Additional context
Required PR : #5696
The text was updated successfully, but these errors were encountered: