Skip to content

BUG: GH4017, efficiently support non-unique indicies with iloc #4018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 26, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 25, 2013

closes #4017

This was a bug because the iloc was dealing with a non-unique index (and was
reindexing which is not correct in this situation, instead can effectively
take)

In [1]: df= DataFrame({'A' : [0.1] * 300000, 'B' : [1] * 300000})

In [2]: idx = np.array(range(3000)) * 99

In [3]: expected = df.iloc[idx]

In [4]: df2 = DataFrame({'A' : [0.1] * 100000, 'B' : [1] * 100000})

In [5]: df2 = pd.concat([df2, 2*df2, 3*df2])

In [6]: %timeit df2.iloc[idx]
1000 loops, best of 3: 221 us per loop

In [7]: %timeit df2.loc[idx]
10 loops, best of 3: 25.6 ms per loop

@rhstanton
Copy link
Contributor

Thanks for getting to this so quickly. A quick comment - in the example above, because of the 2* and 3* in line 5, the contents of the two dataframes are actually different (even ignoring the index), so I wouldn't expect the results to be the same, just to take the same amount of time to calculate.

@jreback
Copy link
Contributor Author

jreback commented Jun 25, 2013

if you were using loc that would be true, but remember iloc is based on locations, and it happens that all of the locations are in the first part, so they index the same, IOW all of the elements in idx are < 300000 (which happens is the len of df). df2 is 900k ellements long

@rhstanton
Copy link
Contributor

Oh yes - I meant to put df2 inside the concat, not df...

@jreback
Copy link
Contributor Author

jreback commented Jun 25, 2013

one more question.....

in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty

In [18]: %timeit df2.loc[idx]
1 loops, best of 3: 1.5 s per loop

With no ordering guarantee

In [17]: %timeit df2.loc[df2.index.isin(idx)]
10 loops, best of 3: 31.9 ms per loop

should prob just document this? what do you think (this only really matters when you lots of indexers)

@rhstanton
Copy link
Contributor

Are those timings with a newer version than currently on github? Assuming this is the same df2 and idx as in the example from earlier, df2.loc[idx] take forever on my machine (about 5 minutes so far and it still hasn't given me a result, and that's without using %timeit!)

From: jreback <[email protected]mailto:[email protected]>
Reply-To: pydata/pandas <[email protected]mailto:[email protected]>
Date: Tuesday, June 25, 2013 10:51 AM
To: pydata/pandas <[email protected]mailto:[email protected]>
Cc: Richard Stanton <[email protected]mailto:[email protected]>
Subject: Re: [pandas] BUG: GH4017, efficiently support non-unique indicies with iloc (#4018)

one more question.....

in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty

In [18]: %timeit df2.loc[idx]
1 loops, best of 3: 1.5 s per loop

With no ordering guarantee

In [17]: %timeit df2.loc[df2.index.isin(idx)]
10 loops, best of 3: 31.9 ms per loop

should prob just document this? what do you think (this only really matters when you lots of indexers)


Reply to this email directly or view it on GitHubhttps://github.com//pull/4018#issuecomment-19994377.

@jreback
Copy link
Contributor Author

jreback commented Jun 25, 2013

from this PR, clone this branch and give a try

(master very slow on loc with dup selections)

@rhstanton
Copy link
Contributor

Maybe this should be an option? The time difference really is huge.

Though I guess the user can always use the second version manually, as long as it's documented in very large letters somewhere (e.g., when you type df.loc( and press TAB)

From: jreback <[email protected]mailto:[email protected]>
Reply-To: pydata/pandas <[email protected]mailto:[email protected]>
Date: Tuesday, June 25, 2013 10:51 AM
To: pydata/pandas <[email protected]mailto:[email protected]>
Cc: Richard Stanton <[email protected]mailto:[email protected]>
Subject: Re: [pandas] BUG: GH4017, efficiently support non-unique indicies with iloc (#4018)

one more question.....

in 0.11.1 non-unique indexing was changed to guarantee ordering, IOW, you would get back in the same order as you put in...., however this comes at a speed penalty

In [18]: %timeit df2.loc[idx]
1 loops, best of 3: 1.5 s per loop

With no ordering guarantee

In [17]: %timeit df2.loc[df2.index.isin(idx)]
10 loops, best of 3: 31.9 ms per loop

should prob just document this? what do you think (this only really matters when you lots of indexers)


Reply to this email directly or view it on GitHubhttps://github.com//pull/4018#issuecomment-19994377.

@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2013

@rhstanton ok...give this a try, its a bit faster....was doing stupid things; this is an easy problem once you figure it out.....

notice the different results FYI between using isin which doesn't care about ordering nor about repeated elements in the index (in this case there aren't any). Also note that the index in the df2.loc[idx] case includes all of the elements you asked for, whether they have values or not (they are nan)

In [8]: df2.loc[idx]
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5022 entries, 0 to 296901
Data columns (total 2 columns):
A    3033  non-null values
B    3033  non-null values
dtypes: float64(2)

In [9]: df2.loc[df2.index.isin(idx)]
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3033 entries, 0 to 99990
Data columns (total 2 columns):
A    3033  non-null values
B    3033  non-null values
dtypes: float64(1), int64(1)

Here's an example of using repeated entries; ordering is preserverd

In [14]: df2.loc[np.concatenate([idx,np.array([1,2,3,0])])]
Out[14]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5034 entries, 0 to 0
Data columns (total 2 columns):
A    3045  non-null values
B    3045  non-null values
dtypes: float64(2)

@rhstanton
Copy link
Contributor

Just tried it. It's much faster than before.

@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2013

gr8...thanks for your help...merging soon

PERF: getting an indexer with a non_unique index, now MUCH faster

PERF: vbench for loc/iloc with dups

BUG: sparse reindex needed takeable arg

TST

BUG: correctly interpret tuple/list in non_unique indexers

BUG: df.loc[idx] with out-of-bounds indexers not correctly interpreted

PERF: df.loc with non-unique index not blazing fast!
jreback added a commit that referenced this pull request Jun 26, 2013
BUG: GH4017, efficiently support non-unique indicies with iloc
@jreback jreback merged commit 3b28ece into pandas-dev:master Jun 26, 2013
@rhstanton
Copy link
Contributor

A (very) minor follow-up:

Here are some new test results, where the input dataframe is identical apart from one having a unique index and one having a repeated index. In both cases, iloc now runs very fast, but why does it take over 4x as long with the unique index? I'd have assumed that the timing of iloc should be independent of the index.

df = DataFrame({'A' : [0.1] * 30000000, 'B' : [1] * 30000000})
idx = array(range(30000)) * 99
%timeit a = df.iloc[idx]

1 loops, best of 3: 4.81 ms per loop

df2 = DataFrame({'A' : [0.1] * 10000000, 'B' : [1] * 10000000})
df2 = concat([df2, df2, df2])
%timeit a = df2.iloc[idx]

1 loops, best of 3: 1.15 ms per loop

@jreback
Copy link
Contributor Author

jreback commented Jun 28, 2013

good point....

unique case was converting to labels then back to indexers so doing some extra work, fixed in #4070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Odd behavior from df.iloc
2 participants