BUG: GH4017, efficiently support non-unique indicies with iloc #4018

jreback · 2013-06-25T00:14:19Z

This was a bug because the iloc was dealing with a non-unique index (and was
reindexing which is not correct in this situation, instead can effectively
take)

In [1]: df= DataFrame({'A' : [0.1] * 300000, 'B' : [1] * 300000})

In [2]: idx = np.array(range(3000)) * 99

In [3]: expected = df.iloc[idx]

In [4]: df2 = DataFrame({'A' : [0.1] * 100000, 'B' : [1] * 100000})

In [5]: df2 = pd.concat([df2, 2*df2, 3*df2])

In [6]: %timeit df2.iloc[idx]
1000 loops, best of 3: 221 us per loop

In [7]: %timeit df2.loc[idx]
10 loops, best of 3: 25.6 ms per loop

rhstanton · 2013-06-25T04:08:10Z

Thanks for getting to this so quickly. A quick comment - in the example above, because of the 2* and 3* in line 5, the contents of the two dataframes are actually different (even ignoring the index), so I wouldn't expect the results to be the same, just to take the same amount of time to calculate.

jreback · 2013-06-25T10:03:51Z

if you were using loc that would be true, but remember iloc is based on locations, and it happens that all of the locations are in the first part, so they index the same, IOW all of the elements in idx are < 300000 (which happens is the len of df). df2 is 900k ellements long

rhstanton · 2013-06-25T14:39:15Z

Oh yes - I meant to put df2 inside the concat, not df...

jreback · 2013-06-25T17:50:49Z

one more question.....

in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty

In [18]: %timeit df2.loc[idx]
1 loops, best of 3: 1.5 s per loop

With no ordering guarantee

In [17]: %timeit df2.loc[df2.index.isin(idx)]
10 loops, best of 3: 31.9 ms per loop

should prob just document this? what do you think (this only really matters when you lots of indexers)

rhstanton · 2013-06-25T18:27:53Z

Are those timings with a newer version than currently on github? Assuming this is the same df2 and idx as in the example from earlier, df2.loc[idx] take forever on my machine (about 5 minutes so far and it still hasn't given me a result, and that's without using %timeit!)

From: jreback <[email protected]mailto:[email protected]>
Reply-To: pydata/pandas <[email protected]mailto:[email protected]>
Date: Tuesday, June 25, 2013 10:51 AM
To: pydata/pandas <[email protected]mailto:[email protected]>
Cc: Richard Stanton <[email protected]mailto:[email protected]>
Subject: Re: [pandas] BUG: GH4017, efficiently support non-unique indicies with iloc (#4018)

one more question.....

in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty

In [18]: %timeit df2.loc[idx]
1 loops, best of 3: 1.5 s per loop

With no ordering guarantee

In [17]: %timeit df2.loc[df2.index.isin(idx)]
10 loops, best of 3: 31.9 ms per loop

should prob just document this? what do you think (this only really matters when you lots of indexers)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4018#issuecomment-19994377.

jreback · 2013-06-25T18:47:17Z

from this PR, clone this branch and give a try

(master very slow on loc with dup selections)

rhstanton · 2013-06-25T19:05:52Z

Maybe this should be an option? The time difference really is huge.

Though I guess the user can always use the second version manually, as long as it's documented in very large letters somewhere (e.g., when you type df.loc( and press TAB)

From: jreback <[email protected]mailto:[email protected]>
Reply-To: pydata/pandas <[email protected]mailto:[email protected]>
Date: Tuesday, June 25, 2013 10:51 AM
To: pydata/pandas <[email protected]mailto:[email protected]>
Cc: Richard Stanton <[email protected]mailto:[email protected]>
Subject: Re: [pandas] BUG: GH4017, efficiently support non-unique indicies with iloc (#4018)

one more question.....

in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty

In [18]: %timeit df2.loc[idx]
1 loops, best of 3: 1.5 s per loop

With no ordering guarantee

In [17]: %timeit df2.loc[df2.index.isin(idx)]
10 loops, best of 3: 31.9 ms per loop

should prob just document this? what do you think (this only really matters when you lots of indexers)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4018#issuecomment-19994377.

jreback · 2013-06-26T01:55:22Z

@rhstanton ok...give this a try, its a bit faster....was doing stupid things; this is an easy problem once you figure it out.....

notice the different results FYI between using isin which doesn't care about ordering nor about repeated elements in the index (in this case there aren't any). Also note that the index in the df2.loc[idx] case includes all of the elements you asked for, whether they have values or not (they are nan)

In [8]: df2.loc[idx]
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5022 entries, 0 to 296901
Data columns (total 2 columns):
A    3033  non-null values
B    3033  non-null values
dtypes: float64(2)

In [9]: df2.loc[df2.index.isin(idx)]
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3033 entries, 0 to 99990
Data columns (total 2 columns):
A    3033  non-null values
B    3033  non-null values
dtypes: float64(1), int64(1)

Here's an example of using repeated entries; ordering is preserverd

In [14]: df2.loc[np.concatenate([idx,np.array([1,2,3,0])])]
Out[14]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5034 entries, 0 to 0
Data columns (total 2 columns):
A    3045  non-null values
B    3045  non-null values
dtypes: float64(2)

rhstanton · 2013-06-26T16:40:43Z

Just tried it. It's much faster than before.

jreback · 2013-06-26T16:41:37Z

gr8...thanks for your help...merging soon

PERF: getting an indexer with a non_unique index, now MUCH faster PERF: vbench for loc/iloc with dups BUG: sparse reindex needed takeable arg TST BUG: correctly interpret tuple/list in non_unique indexers BUG: df.loc[idx] with out-of-bounds indexers not correctly interpreted PERF: df.loc with non-unique index not blazing fast!

BUG: GH4017, efficiently support non-unique indicies with iloc

rhstanton · 2013-06-27T21:02:46Z

A (very) minor follow-up:

Here are some new test results, where the input dataframe is identical apart from one having a unique index and one having a repeated index. In both cases, iloc now runs very fast, but why does it take over 4x as long with the unique index? I'd have assumed that the timing of iloc should be independent of the index.

df = DataFrame({'A' : [0.1] * 30000000, 'B' : [1] * 30000000})
idx = array(range(30000)) * 99
%timeit a = df.iloc[idx]

1 loops, best of 3: 4.81 ms per loop

df2 = DataFrame({'A' : [0.1] * 10000000, 'B' : [1] * 10000000})
df2 = concat([df2, df2, df2])
%timeit a = df2.iloc[idx]

1 loops, best of 3: 1.15 ms per loop

jreback · 2013-06-28T13:19:18Z

good point....

unique case was converting to labels then back to indexers so doing some extra work, fixed in #4070

jreback added a commit that referenced this pull request Jun 26, 2013

Merge pull request #4018 from jreback/iloc_bug

3b28ece

BUG: GH4017, efficiently support non-unique indicies with iloc

jreback merged commit 3b28ece into pandas-dev:master Jun 26, 2013

jreback mentioned this pull request Jun 28, 2013

PERF: optimize iloc for unique case #4070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GH4017, efficiently support non-unique indicies with iloc #4018

BUG: GH4017, efficiently support non-unique indicies with iloc #4018

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 26, 2013

rhstanton commented Jun 26, 2013

jreback commented Jun 26, 2013

rhstanton commented Jun 27, 2013

jreback commented Jun 28, 2013

BUG: GH4017, efficiently support non-unique indicies with iloc #4018

BUG: GH4017, efficiently support non-unique indicies with iloc #4018

Conversation

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 25, 2013

rhstanton commented Jun 25, 2013

jreback commented Jun 26, 2013

rhstanton commented Jun 26, 2013

jreback commented Jun 26, 2013

rhstanton commented Jun 27, 2013

jreback commented Jun 28, 2013