Huge performance hit in get_indexer_non_unique #15364

horta · 2017-02-10T19:45:49Z

Let df0 and df1 be indexed data frames. Assume that at least one of them has a non-unique index. If I'm not mistaken, the function #L309 will always be called upon the operation:

df0.loc[df1.index]

Now imagine that both of indexes are reasonably large (lets say at least one millions of records). The operations at lines

will be of order O(n^2*log(n)^2) even when both indices are sorted.

Am I missing something? Let me know if the above reasoning is right so I can help with it. If both indexes are sorted, It looks like that function can be as fast as O(n) simply by looping over the values and targets arrays simultaneously. (Assuming that the index duplication is very small compared to n.

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-10T19:53:32Z

of course, you shouldn't use a non-unique MI.

horta · 2017-02-10T19:54:16Z

Why not?

jreback · 2017-02-10T19:54:22Z

if you have a perf fix, please submit a pull-request.

jreback · 2017-02-10T19:55:08Z

@horta well we use a hashtable to look up values in a unique MI. so this is quite performant. Non-unique are very uncommon , and hence don't get much attention. Indexing is hard / non-performant.

jreback · 2017-02-10T19:57:32Z

To be honest this has not come up before, and is not benchmarked at all. If you have a small example would be great. Better yet a soln!

jreback · 2017-02-10T20:07:33Z

So I would buy that this is inefficient (it doesn't matter if its a MI or not, FYI)

In [6]: Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())
Out[6]: 
Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,      8,      9,
            ...
            999990, 999991, 999992, 999993, 999994, 999995, 999996, 999997, 999998, 999999],
           dtype='int64', length=2000000)

In [7]: i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

In [8]: len(i)
Out[8]: 2000000

In [9]: i.is_unique
Out[9]: False

In [10]: %time i.get_indexer_non_unique(i[0:10000])
CPU times: user 216 ms, sys: 10.1 ms, total: 226 ms
Wall time: 226 ms
Out[10]: 
(Int64Index([      0, 1000000,       1, 1000001,       2, 1000002,       3, 1000003,       4, 1000004,
             ...
                9995, 1009995,    9996, 1009996,    9997, 1009997,    9998, 1009998,    9999, 1009999],
            dtype='int64', length=20000), array([], dtype=int64))

In [11]: i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

In [12]: %time i.get_indexer_non_unique(i[0:100000])
CPU times: user 372 ms, sys: 33.3 ms, total: 406 ms
Wall time: 406 ms
Out[12]: 
(Int64Index([      0, 1000000,       1, 1000001,       2, 1000002,       3, 1000003,       4, 1000004,
             ...
               99995, 1099995,   99996, 1099996,   99997, 1099997,   99998, 1099998,   99999, 1099999],
            dtype='int64', length=200000), array([], dtype=int64))

In [13]: i = Index(Index(np.arange(1000000)).tolist() + Index(np.arange(1000000)).tolist())

In [14]: %time i.get_indexer_non_unique(i[0:1000000])
CPU times: user 2.7 s, sys: 1.35 s, total: 4.05 s
Wall time: 4.06 s
Out[14]: 
(Int64Index([      0, 1000000,       1, 1000001,       2, 1000002,       3, 1000003,       4, 1000004,
             ...
              999995, 1999995,  999996, 1999996,  999997, 1999997,  999998, 1999998,  999999, 1999999],
            dtype='int64', length=2000000), array([], dtype=int64))

horta · 2017-02-10T20:10:45Z

omg that was quick o.o

Thanks for the feedback! I actually have an example that is taking like 30 minutes to finish, and I'm working with not so big indices. I will clean that example up and post here asap.

jreback · 2017-02-10T20:19:55Z

yeah this certainly can be fixed, but when I wrote this a while back it was quick and dirtly. Welcome to have you work on it!

That cython code is pretty inefficient actually. Need a better structure that can be completely in-lined.

jreback added MultiIndex Performance Memory or execution speed performance labels Feb 10, 2017

jreback added this to the Next Major Release milestone Feb 10, 2017

jreback added Difficulty Advanced labels Feb 10, 2017

horta mentioned this issue Feb 12, 2017

get_indexer_non_unique for orderable indexes #15372

Closed

horta mentioned this issue Feb 21, 2017

PERF: fast non-unique indexing #15468

Closed

jreback mentioned this issue Sep 25, 2018

Loc enhancements #22826

Merged

4 tasks

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 21, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

lukemanley mentioned this issue Nov 3, 2023

PERF: IndexEngine.get_indexer_non_unique #55816

Merged

5 tasks

mroeschke closed this as completed in #55816 Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge performance hit in get_indexer_non_unique #15364

Huge performance hit in get_indexer_non_unique #15364

horta commented Feb 10, 2017

jreback commented Feb 10, 2017

horta commented Feb 10, 2017

jreback commented Feb 10, 2017

jreback commented Feb 10, 2017

jreback commented Feb 10, 2017

jreback commented Feb 10, 2017

horta commented Feb 10, 2017

jreback commented Feb 10, 2017 •

edited

Loading

Huge performance hit in get_indexer_non_unique #15364

Huge performance hit in get_indexer_non_unique #15364

Comments

horta commented Feb 10, 2017

jreback commented Feb 10, 2017

horta commented Feb 10, 2017

jreback commented Feb 10, 2017

jreback commented Feb 10, 2017

jreback commented Feb 10, 2017

jreback commented Feb 10, 2017

horta commented Feb 10, 2017

jreback commented Feb 10, 2017 • edited Loading

jreback commented Feb 10, 2017 •

edited

Loading