Skip to content

DataFrame.set_index when setting a duplicate name now raises #30965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Jan 13, 2020 · 1 comment
Open

DataFrame.set_index when setting a duplicate name now raises #30965

TomAugspurger opened this issue Jan 13, 2020 · 1 comment
Labels
API Design MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@TomAugspurger
Copy link
Contributor

As part of #30588, we now raise when trying to create a 2D index. This introduces a behavior change when you call DataFrame.set_index with duplicate data.

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], columns=['a', 'a', 'b'])

In [3]: result = df.set_index('a')

On pandas 0.25.3, that gives back a DataFrame with a broken Index. Some DataFrame operations will work, but even things like printing the repr will fail

# 0.25.3
In [17]: type(result)
Out[17]: pandas.core.frame.DataFrame

In [18]: result.shape
Out[18]: (1, 1)

With 1.0.0rc0, that raises

~/sandbox/pandas/pandas/core/indexes/numeric.py in __new__(cls, data, dtype, copy, name)
     76         if subarr.ndim > 1:
     77             # GH#13601, GH#20285, GH#27125
---> 78             raise ValueError("Index data must be 1-dimensional")
     79
     80         name = maybe_extract_name(name, data, cls)

ValueError: Index data must be 1-dimensional

Problem description

The old output is clearly broken, so I wouldn't consider this a (major) regression. And I don't think people should be doing this in the first place. But I wanted to ask, should DataFrame.set_index(scalar) return a MultiIndex when scalar is a duplicate label?

TomAugspurger added a commit to TomAugspurger/dask that referenced this issue Jan 13, 2020
TomAugspurger added a commit to dask/dask that referenced this issue Jan 13, 2020
* Use pytest warns
* Fixed duplicate index: xref pandas-dev/pandas#30965
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Jul 27, 2021
@sfc-gh-mvashishtha
Copy link

should DataFrame.set_index(scalar) return a MultiIndex when scalar is a duplicate label?

I think so.

Also, the behavior of set_index(duplicate_column_name) seems to depend on the types of the columns that share a name. For timedelta + int, we get an Index of tuples, whereas for 2 int columns, we get a ValueError:

import pandas as pd

df = pd.DataFrame([[pd.Timedelta(1), 1, 2]], columns=['a', 'a', 'b'])

# this works and sets the index to the single tuple (0 days 00:00:00.000000001, 1)
print(df.set_index('a'))

df = pd.DataFrame([[1, 1, 2]], columns=['a', 'a', 'b'])

# this raises `ValueError: Index data must be 1-dimensional`
print(df.set_index('a'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants