-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Huge performance hit in get_indexer_non_unique #15364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
of course, you shouldn't use a non-unique MI. |
Why not? |
if you have a perf fix, please submit a pull-request. |
@horta well we use a hashtable to look up values in a unique MI. so this is quite performant. Non-unique are very uncommon , and hence don't get much attention. Indexing is hard / non-performant. |
To be honest this has not come up before, and is not benchmarked at all. If you have a small example would be great. Better yet a soln! |
So I would buy that this is inefficient (it doesn't matter if its a MI or not, FYI)
|
omg that was quick o.o Thanks for the feedback! I actually have an example that is taking like 30 minutes to finish, and I'm working with not so big indices. I will clean that example up and post here asap. |
yeah this certainly can be fixed, but when I wrote this a while back it was quick and dirtly. Welcome to have you work on it! That cython code is pretty inefficient actually. Need a better structure that can be completely in-lined. |
Let
df0
anddf1
be indexed data frames. Assume that at least one of them has a non-unique index. If I'm not mistaken, the function #L309 will always be called upon the operation:Now imagine that both of indexes are reasonably large (lets say at least one millions of records). The operations at lines
will be of order
O(n^2*log(n)^2)
even when both indices are sorted.Am I missing something? Let me know if the above reasoning is right so I can help with it. If both indexes are sorted, It looks like that function can be as fast as
O(n)
simply by looping over thevalues
andtargets
arrays simultaneously. (Assuming that the index duplication is very small compared ton
.The text was updated successfully, but these errors were encountered: