Skip to content

reindex from a duplicate axis: inconsistent behaviour #8849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
urraca opened this issue Nov 18, 2014 · 4 comments
Open

reindex from a duplicate axis: inconsistent behaviour #8849

urraca opened this issue Nov 18, 2014 · 4 comments
Labels

Comments

@urraca
Copy link

urraca commented Nov 18, 2014

The behaviour below occurs in version '0.15.1'.

When a series has a duplicate index, the method reindex will raise an exception, unless the index passed to reindex is identical to the series' index.

I propose that when a series has a duplicate index, the method reindex should always raise an exception, because when a series with a duplicate index is to be conformed to a new index, the intended behaviour is always ambiguous.

This issue applies to the methods reindex_like and reindex_axis too.

Examples of current behaviour:

(a)

>>> pd.Series([1, 2, 3], index=['a', 'b', 'b']).reindex(['a', 'b'])
ValueError: cannot reindex from a duplicate axis

(b)

>>> pd.Series([1, 2, 3], index=['a', 'b', 'b']).reindex(['a', 'b', 'b'])
a    1
b    2
b    3
dtype: int64

The exception message in (a) implies that (b) should raise; but it doesn't.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

can you try to change and see what breaks in the test suite?

The compariosons of index objects on a reindex is really for efficiency, since the are equal (or indentical), no reindexing is necessary.

Why are you suggesting that this should raise? (meaning what is the use case)

@urraca
Copy link
Author

urraca commented Nov 18, 2014

In (b), why is what is returned the right answer? Why should it not be:

a    1
b    2
b    2
dtype: int64

It seems to me that this sort of ambiguity is the justification for (a) raising.

I'll have a think about use cases and I'll look at the test suite.

@jreback
Copy link
Contributor

jreback commented Nov 18, 2014

@urraca ok, have a look, but I don't think what you just showed, e.g.

In [2]: Series([1,2,2],['a','b','b'])
Out[2]: 
a    1
b    2
b    2
dtype: int64

would be correct (e.g. why would it arbitrary take the 2 and not the 3?)

It returne it unchanged, and doesn't do anything. You are suggesting it should raise.
So give a look at the test suite and see if this would impact anything.

This is what it would 'reindex' to if it was not exactly the same

In [7]: s.take(s.index.get_indexer_non_unique(s.index)[0])
Out[7]: 
a    1
b    2
b    3
b    2
b    3
dtype: int64

@mrocklin
Copy link
Contributor

I ran into something like this in 0.16.1

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2, 3]}, index=[0, 1, 0])

In [3]: df.groupby(level=0).apply(lambda x: x)
ValueError: cannot reindex from a duplicate axis

I expected something like the same dataframe out again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants