Skip to content

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
pymrkc opened this issue Nov 12, 2021 · 4 comments · Fixed by #44755
Closed
3 tasks done

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

pymrkc opened this issue Nov 12, 2021 · 4 comments · Fixed by #44755
Assignees
Labels
Docs MultiIndex Window rolling, ewma, expanding
Milestone

Comments

@pymrkc
Copy link

pymrkc commented Nov 12, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
  1001: 'carrot',
  1002: 'carrot',
  1003: 'apple',
  1004: 'apple',
  1005: 'carrot'},
 'date': {1000: Timestamp('2021-10-27 00:00:00'),
  1001: Timestamp('2021-10-27 00:00:00'),
  1002: Timestamp('2021-10-28 00:00:00'),
  1003: Timestamp('2021-10-28 00:00:00'),
  1004: Timestamp('2021-10-29 00:00:00'),
  1005: Timestamp('2021-10-29 00:00:00')},
 'stock': {1000: 100,
  1001: 150,
  1002: 75,
  1003: 50,
  1004: 200,
  1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling                                                                                                                                                                                     
                          stock
label  label  date             
apple  apple  2021-10-27  100.0
              2021-10-28  150.0
              2021-10-29  350.0
carrot carrot 2021-10-27  150.0
              2021-10-28  225.0
              2021-10-29  245.0
>>> df_rolling.index                                                                                                                                                                               
MultiIndex([( 'apple',  'apple', '2021-10-27'),
            ( 'apple',  'apple', '2021-10-28'),
            ( 'apple',  'apple', '2021-10-29'),
            ('carrot', 'carrot', '2021-10-27'),
            ('carrot', 'carrot', '2021-10-28'),
            ('carrot', 'carrot', '2021-10-29')],
           names=['label', 'label', 'date'])
>>> df_rolling = df_rolling.reset_index()                                                                                                                                                          
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-8b81c1e32ea2> in <module>
----> 1 df_rolling = df_rolling.reset_index()

/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   5797                     )
   5798 
-> 5799                 new_obj.insert(0, name, level_values)
   5800 
   5801         new_obj.index = new_index

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   4412         if not allow_duplicates and column in self.columns:
   4413             # Should this be a different kind of error??
-> 4414             raise ValueError(f"cannot insert {column}, already exists")
   4415         if not isinstance(loc, int):
   4416             raise TypeError("loc must be int")

ValueError: cannot insert label, already exists

Issue Description

In pandas 1.1.5, this code works fine; the MultiIndex looks as you'd expect it to, and the reset_index call works fine. This code breaks in 1.3.4.

Expected Behavior

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
  1001: 'carrot',
  1002: 'carrot',
  1003: 'apple',
  1004: 'apple',
  1005: 'carrot'},
 'date': {1000: Timestamp('2021-10-27 00:00:00'),
  1001: Timestamp('2021-10-27 00:00:00'),
  1002: Timestamp('2021-10-28 00:00:00'),
  1003: Timestamp('2021-10-28 00:00:00'),
  1004: Timestamp('2021-10-29 00:00:00'),
  1005: Timestamp('2021-10-29 00:00:00')},
 'stock': {1000: 100,
  1001: 150,
  1002: 75,
  1003: 50,
  1004: 200,
  1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling
        stock
label        
apple   100.0
apple   150.0
apple   350.0
carrot  150.0
carrot  225.0
carrot  245.0
>>> df_rolling.index
MultiIndex([( 'apple',),
            ( 'apple',),
            ( 'apple',),
            ('carrot',),
            ('carrot',),
            ('carrot',)],
           names=['label'])
>>> df_rolling = df_rolling.reset_index()
>>> df_rolling.index
RangeIndex(start=0, stop=6, step=1)

Installed Versions

In [60]: pandas.show_versions()

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-37-generic
Version : #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2019.3
dateutil : 2.7.3
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@pymrkc pymrkc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 12, 2021
@phofl phofl added Groupby MultiIndex Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 12, 2021
@mroeschke
Copy link
Member

The MultiIndex result was intentionally changed as described in the release notes: https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#groupby-rolling-with-multiindex-no-longer-drops-levels-in-the-result

Therefore, the potential bug is calling reset_index on a MultiIndex with duplicate levels raises a ValueError

@mroeschke mroeschke removed Groupby Window rolling, ewma, expanding labels Nov 14, 2021
@mroeschke mroeschke changed the title BUG: groupby and rolling on a df with a MultiIndex messes up the MultiIndex BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError Nov 14, 2021
@phofl
Copy link
Member

phofl commented Nov 26, 2021

@mroeschke

We currently have one test covering the behavior, that this should raise a ValueError, when either names from the index are already represented in the DataFrames columns or the index has duplicated names.

So we can deprecate this behavior and change with 2.0 or introduce a new keyword, something like allow_duplicates, which would be backwards compatible

Edit: Or warning about this in the rolling docs. Imo accidentially ending up with duplicates in the columns is not desireable

@mroeschke
Copy link
Member

I see, so the the reset_index behavior on a MultiIndex is tested and expected behavior.

I don't have any string opinions whether to change that behavior (I'm okay with leaving it alone since it's easier), so in relation to the rolling behavior, I would be in favor of just documenting it.

@johnzangwill
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment