-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Increase threashold for using binary search in IndexEngine #54746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Increase threashold for using binary search in IndexEngine #54746
Conversation
@@ -375,7 +375,7 @@ cdef class IndexEngine: | |||
# map each starget to its position in the index | |||
if ( | |||
stargets and | |||
len(stargets) < 5 and | |||
len(stargets) < (n / (2 * n.bit_length())) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit_lenght
is just a fast way of computing the ceiling of log2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I think this is good
doc/source/whatsnew/v2.2.0.rst
Outdated
@@ -105,7 +105,7 @@ Deprecations | |||
|
|||
Performance improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
- | |||
- Increase the threshold to use binary search instead of a linear search in IndexEngine :issue:`54550` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Increase the threshold to use binary search instead of a linear search in IndexEngine :issue:`54550` | |
- Performance improvement when indexing with more than 4 keys (:issue:`54550`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks a lot for the review.
done!
Thanks @l3robot |
Glad to contribute! |
…das-dev#54746) * Increase threashold for using binary search in IndexEngine * Add an entry to the latest whatsnew * Improve entry in the lastest whatsnew --------- Co-authored-by: Patrick Hoefler <[email protected]>
Checklist
.loc
with big non-unique index dataframe #54550 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Explanation
So after a bit of research for #54550, I found that the source of the behavior is coming from PR #22826 at this line and particularly this comment here.
In the comment, @rtlee9 makes an excellent explanation for the reason behind the threshold. The default behavior is a linear search (
O(n)
) in the pandas index and #22826 added a binary search optimization for monotonically increasing index which is inO(m*log(n))
wherem
is the number of target indexes.Proposed Solution
I'm proposing we increase the limit, which was set conservatively to 5. We could at least increase it to n / (2 * log2[n]) which is derivable from m*log2[n]<n and dividing per 2 for keeping a certain margin. I'll open a PR for that.
New results of benchmarking + homemade script
Here is a screenshot of
asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b ^indexing.*Index*
output (done on Macbook Air m1, inside the docker container provided by pandas):Here is also a new output for the script use to demonstrate the problem in #54700:
Thanks in advance for the review!