-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GH4017, efficiently support non-unique indicies with iloc #4018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for getting to this so quickly. A quick comment - in the example above, because of the 2* and 3* in line 5, the contents of the two dataframes are actually different (even ignoring the index), so I wouldn't expect the results to be the same, just to take the same amount of time to calculate. |
if you were using |
Oh yes - I meant to put df2 inside the concat, not df... |
one more question..... in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty
With no ordering guarantee
should prob just document this? what do you think (this only really matters when you lots of indexers) |
Are those timings with a newer version than currently on github? Assuming this is the same df2 and idx as in the example from earlier, df2.loc[idx] take forever on my machine (about 5 minutes so far and it still hasn't given me a result, and that's without using %timeit!) From: jreback <[email protected]mailto:[email protected]> one more question..... in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty In [18]: %timeit df2.loc[idx] With no ordering guarantee In [17]: %timeit df2.loc[df2.index.isin(idx)] should prob just document this? what do you think (this only really matters when you lots of indexers) — |
from this PR, clone this branch and give a try (master very slow on |
Maybe this should be an option? The time difference really is huge. Though I guess the user can always use the second version manually, as long as it's documented in very large letters somewhere (e.g., when you type df.loc( and press TAB) From: jreback <[email protected]mailto:[email protected]> one more question..... in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty In [18]: %timeit df2.loc[idx] With no ordering guarantee In [17]: %timeit df2.loc[df2.index.isin(idx)] should prob just document this? what do you think (this only really matters when you lots of indexers) — |
@rhstanton ok...give this a try, its a bit faster....was doing stupid things; this is an easy problem once you figure it out..... notice the different results FYI between using
Here's an example of using repeated entries; ordering is preserverd
|
Just tried it. It's much faster than before. |
gr8...thanks for your help...merging soon |
PERF: getting an indexer with a non_unique index, now MUCH faster PERF: vbench for loc/iloc with dups BUG: sparse reindex needed takeable arg TST BUG: correctly interpret tuple/list in non_unique indexers BUG: df.loc[idx] with out-of-bounds indexers not correctly interpreted PERF: df.loc with non-unique index not blazing fast!
BUG: GH4017, efficiently support non-unique indicies with iloc
A (very) minor follow-up: Here are some new test results, where the input dataframe is identical apart from one having a unique index and one having a repeated index. In both cases, iloc now runs very fast, but why does it take over 4x as long with the unique index? I'd have assumed that the timing of iloc should be independent of the index. df = DataFrame({'A' : [0.1] * 30000000, 'B' : [1] * 30000000}) 1 loops, best of 3: 4.81 ms per loop df2 = DataFrame({'A' : [0.1] * 10000000, 'B' : [1] * 10000000}) 1 loops, best of 3: 1.15 ms per loop |
good point.... unique case was converting to labels then back to indexers so doing some extra work, fixed in #4070 |
closes #4017
This was a bug because the iloc was dealing with a non-unique index (and was
reindexing which is not correct in this situation, instead can effectively
take)