Skip to content

Huge performance hit in get_indexer_non_unique #15364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
horta opened this issue Feb 10, 2017 · 8 comments · Fixed by #55816
Closed

Huge performance hit in get_indexer_non_unique #15364

horta opened this issue Feb 10, 2017 · 8 comments · Fixed by #55816
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance

Comments

@horta
Copy link
Contributor

horta commented Feb 10, 2017

Let df0 and df1 be indexed data frames. Assume that at least one of them has a non-unique index. If I'm not mistaken, the function #L309 will always be called upon the operation:

df0.loc[df1.index]

Now imagine that both of indexes are reasonably large (lets say at least one millions of records). The operations at lines

will be of order O(n^2*log(n)^2) even when both indices are sorted.

Am I missing something? Let me know if the above reasoning is right so I can help with it. If both indexes are sorted, It looks like that function can be as fast as O(n) simply by looping over the values and targets arrays simultaneously. (Assuming that the index duplication is very small compared to n.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

of course, you shouldn't use a non-unique MI.

@horta
Copy link
Contributor Author

horta commented Feb 10, 2017

Why not?

@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

if you have a perf fix, please submit a pull-request.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

@horta well we use a hashtable to look up values in a unique MI. so this is quite performant. Non-unique are very uncommon , and hence don't get much attention. Indexing is hard / non-performant.

@jreback jreback added MultiIndex Performance Memory or execution speed performance labels Feb 10, 2017
@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

To be honest this has not come up before, and is not benchmarked at all. If you have a small example would be great. Better yet a soln!

@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

So I would buy that this is inefficient (it doesn't matter if its a MI or not, FYI)

In [6]: Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())
Out[6]: 
Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,      8,      9,
            ...
            999990, 999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999],
           dtype='int64', length=2000000)

In [7]: i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

In [8]: len(i)
Out[8]: 2000000

In [9]: i.is_unique
Out[9]: False

In [10]: %time i.get_indexer_non_unique(i[0:10000])
CPU times: user 216 ms, sys: 10.1 ms, total: 226 ms
Wall time: 226 ms
Out[10]: 
(Int64Index([      0, 1000000,       1, 1000001,       2, 1000002,       3, 1000003,       4, 1000004,
             ...
                9995, 1009995,    9996, 1009996,    9997, 1009997,    9998, 1009998,    9999, 1009999],
            dtype='int64', length=20000), array([], dtype=int64))

In [11]: i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

In [12]: %time i.get_indexer_non_unique(i[0:100000])
CPU times: user 372 ms, sys: 33.3 ms, total: 406 ms
Wall time: 406 ms
Out[12]: 
(Int64Index([      0, 1000000,       1, 1000001,       2, 1000002,       3, 1000003,       4, 1000004,
             ...
               99995, 1099995,   99996, 1099996,   99997, 1099997,   99998, 1099998,   99999, 1099999],
            dtype='int64', length=200000), array([], dtype=int64))

In [13]: i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

In [14]: %time i.get_indexer_non_unique(i[0:1000000])
CPU times: user 2.7 s, sys: 1.35 s, total: 4.05 s
Wall time: 4.06 s
Out[14]: 
(Int64Index([      0, 1000000,       1, 1000001,       2, 1000002,       3, 1000003,       4, 1000004,
             ...
              999995, 1999995,  999996, 1999996,  999997, 1999997,  999998, 1999998,  999999, 1999999],
            dtype='int64', length=2000000), array([], dtype=int64))

@jreback jreback added this to the Next Major Release milestone Feb 10, 2017
@horta
Copy link
Contributor Author

horta commented Feb 10, 2017

omg that was quick o.o

Thanks for the feedback! I actually have an example that is taking like 30 minutes to finish, and I'm working with not so big indices. I will clean that example up and post here asap.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

yeah this certainly can be fixed, but when I wrote this a while back it was quick and dirtly. Welcome to have you work on it!

That cython code is pretty inefficient actually. Need a better structure that can be completely in-lined.

@jreback jreback mentioned this issue Sep 25, 2018
4 tasks
@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 21, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants