DataFrame.set_index when setting a duplicate name now raises #30965

TomAugspurger · 2020-01-13T13:58:13Z

As part of #30588, we now raise when trying to create a 2D index. This introduces a behavior change when you call DataFrame.set_index with duplicate data.

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([[1, 2, 3]], columns=['a', 'a', 'b'])

In [3]: result = df.set_index('a')

On pandas 0.25.3, that gives back a DataFrame with a broken Index. Some DataFrame operations will work, but even things like printing the repr will fail

# 0.25.3
In [17]: type(result)
Out[17]: pandas.core.frame.DataFrame

In [18]: result.shape
Out[18]: (1, 1)

With 1.0.0rc0, that raises

~/sandbox/pandas/pandas/core/indexes/numeric.py in __new__(cls, data, dtype, copy, name)
     76         if subarr.ndim > 1:
     77             # GH#13601, GH#20285, GH#27125
---> 78             raise ValueError("Index data must be 1-dimensional")
     79
     80         name = maybe_extract_name(name, data, cls)

ValueError: Index data must be 1-dimensional

Problem description

The old output is clearly broken, so I wouldn't consider this a (major) regression. And I don't think people should be doing this in the first place. But I wanted to ask, should DataFrame.set_index(scalar) return a MultiIndex when scalar is a duplicate label?

The text was updated successfully, but these errors were encountered:

xref pandas-dev/pandas#30965

* Use pytest warns * Fixed duplicate index: xref pandas-dev/pandas#30965

sfc-gh-mvashishtha · 2024-10-10T21:44:24Z

should DataFrame.set_index(scalar) return a MultiIndex when scalar is a duplicate label?

I think so.

Also, the behavior of set_index(duplicate_column_name) seems to depend on the types of the columns that share a name. For timedelta + int, we get an Index of tuples, whereas for 2 int columns, we get a ValueError:

import pandas as pd

df = pd.DataFrame([[pd.Timedelta(1), 1, 2]], columns=['a', 'a', 'b'])

# this works and sets the index to the single tuple (0 days 00:00:00.000000001, 1)
print(df.set_index('a'))

df = pd.DataFrame([[1, 1, 2]], columns=['a', 'a', 'b'])

# this raises `ValueError: Index data must be 1-dimensional`
print(df.set_index('a'))

TomAugspurger added API Design MultiIndex labels Jan 13, 2020

TomAugspurger added a commit to TomAugspurger/dask that referenced this issue Jan 13, 2020

Fixed duplicate index

e931248

xref pandas-dev/pandas#30965

TomAugspurger added a commit to dask/dask that referenced this issue Jan 13, 2020

Pandas 1.0 compat (#5782)

0b9a62b

* Use pytest warns * Fixed duplicate index: xref pandas-dev/pandas#30965

mroeschke added the Needs Discussion Requires discussion from core team before further action label Jul 27, 2021

TomAugspurger mentioned this issue Oct 11, 2022

Forbid columns with duplicate names dask/dask#9422

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.set_index when setting a duplicate name now raises #30965

DataFrame.set_index when setting a duplicate name now raises #30965

TomAugspurger commented Jan 13, 2020

sfc-gh-mvashishtha commented Oct 10, 2024

DataFrame.set_index when setting a duplicate name now raises #30965

DataFrame.set_index when setting a duplicate name now raises #30965

Comments

TomAugspurger commented Jan 13, 2020

Code Sample, a copy-pastable example if possible

Problem description

sfc-gh-mvashishtha commented Oct 10, 2024