BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

pymrkc · 2021-11-12T15:25:52Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
  1001: 'carrot',
  1002: 'carrot',
  1003: 'apple',
  1004: 'apple',
  1005: 'carrot'},
 'date': {1000: Timestamp('2021-10-27 00:00:00'),
  1001: Timestamp('2021-10-27 00:00:00'),
  1002: Timestamp('2021-10-28 00:00:00'),
  1003: Timestamp('2021-10-28 00:00:00'),
  1004: Timestamp('2021-10-29 00:00:00'),
  1005: Timestamp('2021-10-29 00:00:00')},
 'stock': {1000: 100,
  1001: 150,
  1002: 75,
  1003: 50,
  1004: 200,
  1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling                                                                                                                                                                                     
                          stock
label  label  date             
apple  apple  2021-10-27  100.0
              2021-10-28  150.0
              2021-10-29  350.0
carrot carrot 2021-10-27  150.0
              2021-10-28  225.0
              2021-10-29  245.0
>>> df_rolling.index                                                                                                                                                                               
MultiIndex([( 'apple',  'apple', '2021-10-27'),
            ( 'apple',  'apple', '2021-10-28'),
            ( 'apple',  'apple', '2021-10-29'),
            ('carrot', 'carrot', '2021-10-27'),
            ('carrot', 'carrot', '2021-10-28'),
            ('carrot', 'carrot', '2021-10-29')],
           names=['label', 'label', 'date'])
>>> df_rolling = df_rolling.reset_index()                                                                                                                                                          
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-8b81c1e32ea2> in <module>
----> 1 df_rolling = df_rolling.reset_index()

/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   5797                     )
   5798 
-> 5799                 new_obj.insert(0, name, level_values)
   5800 
   5801         new_obj.index = new_index

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   4412         if not allow_duplicates and column in self.columns:
   4413             # Should this be a different kind of error??
-> 4414             raise ValueError(f"cannot insert {column}, already exists")
   4415         if not isinstance(loc, int):
   4416             raise TypeError("loc must be int")

ValueError: cannot insert label, already exists

Issue Description

In pandas 1.1.5, this code works fine; the MultiIndex looks as you'd expect it to, and the reset_index call works fine. This code breaks in 1.3.4.

Expected Behavior

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
  1001: 'carrot',
  1002: 'carrot',
  1003: 'apple',
  1004: 'apple',
  1005: 'carrot'},
 'date': {1000: Timestamp('2021-10-27 00:00:00'),
  1001: Timestamp('2021-10-27 00:00:00'),
  1002: Timestamp('2021-10-28 00:00:00'),
  1003: Timestamp('2021-10-28 00:00:00'),
  1004: Timestamp('2021-10-29 00:00:00'),
  1005: Timestamp('2021-10-29 00:00:00')},
 'stock': {1000: 100,
  1001: 150,
  1002: 75,
  1003: 50,
  1004: 200,
  1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling
        stock
label        
apple   100.0
apple   150.0
apple   350.0
carrot  150.0
carrot  225.0
carrot  245.0
>>> df_rolling.index
MultiIndex([( 'apple',),
            ( 'apple',),
            ( 'apple',),
            ('carrot',),
            ('carrot',),
            ('carrot',)],
           names=['label'])
>>> df_rolling = df_rolling.reset_index()
>>> df_rolling.index
RangeIndex(start=0, stop=6, step=1)

Installed Versions

In [60]: pandas.show_versions()

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-37-generic
Version : #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2019.3
dateutil : 2.7.3
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-11-14T05:18:17Z

The MultiIndex result was intentionally changed as described in the release notes: https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#groupby-rolling-with-multiindex-no-longer-drops-levels-in-the-result

Therefore, the potential bug is calling reset_index on a MultiIndex with duplicate levels raises a ValueError

phofl · 2021-11-26T14:36:57Z

@mroeschke

We currently have one test covering the behavior, that this should raise a ValueError, when either names from the index are already represented in the DataFrames columns or the index has duplicated names.

So we can deprecate this behavior and change with 2.0 or introduce a new keyword, something like allow_duplicates, which would be backwards compatible

Edit: Or warning about this in the rolling docs. Imo accidentially ending up with duplicates in the columns is not desireable

mroeschke · 2021-11-26T19:59:15Z

I see, so the the reset_index behavior on a MultiIndex is tested and expected behavior.

I don't have any string opinions whether to change that behavior (I'm okay with leaving it alone since it's easier), so in relation to the rolling behavior, I would be in favor of just documenting it.

johnzangwill · 2021-12-05T15:48:14Z

take

pymrkc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 12, 2021

phofl added Groupby MultiIndex Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 12, 2021

mroeschke removed Groupby Window rolling, ewma, expanding labels Nov 14, 2021

mroeschke changed the title ~~BUG: groupby and rolling on a df with a MultiIndex messes up the MultiIndex~~ BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError Nov 14, 2021

mroeschke added Docs Window rolling, ewma, expanding and removed Bug labels Nov 26, 2021

phofl mentioned this issue Dec 1, 2021

BUG: Apply ewm to GroupBy object results in duplicate index columns #44696

Closed

3 tasks

This was referenced Dec 3, 2021

Some refinements johnzangwill/pandas#5

Merged

ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755

Merged

github-actions bot assigned johnzangwill Dec 5, 2021

johnzangwill mentioned this issue Dec 29, 2021

ENH: Use flags.allows_duplicate_labels to define default insert behavior #45109

Closed

4 tasks

jreback added this to the 1.5 milestone Jan 30, 2022

jreback closed this as completed in #44755 Jan 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

pymrkc commented Nov 12, 2021

INSTALLED VERSIONS

mroeschke commented Nov 14, 2021

phofl commented Nov 26, 2021 •

edited

Loading

mroeschke commented Nov 26, 2021

johnzangwill commented Dec 5, 2021

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

Comments

pymrkc commented Nov 12, 2021

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

mroeschke commented Nov 14, 2021

phofl commented Nov 26, 2021 • edited Loading

mroeschke commented Nov 26, 2021

johnzangwill commented Dec 5, 2021

phofl commented Nov 26, 2021 •

edited

Loading