Skip to content

BUG: surprising and possibly erroneous behavior of GroupBy.apply with an indexed series (index winds up duplicated) #35670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
aecay opened this issue Aug 11, 2020 · 2 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@aecay
Copy link

aecay commented Aug 11, 2020

  • I have checked that this issue has not already been reported.
    • I can't find anything in the bug tracker that matches the symptoms I'm reporting here, although it's a bit difficult to search (I'm not totally sure how to describe it)
  • I have confirmed this bug exists on the latest version of pandas.

Code Sample, a copy-pastable example

import pandas as pd

data = [{"label": l, "x" : x, "y": x + 1} for l in ("foo", "bar") for x in range(5)]
df = pd.DataFrame(data)
df = df.set_index(["label", "x"])
series = df["y"]
series2 = series.groupby(["label"]).apply(lambda s: s[2:])
print(series2.index)

# Output:

MultiIndex([('bar', 'bar', 2),
            ('bar', 'bar', 3),
            ('bar', 'bar', 4),
            ('foo', 'foo', 2),
            ('foo', 'foo', 3),
            ('foo', 'foo', 4)],
           names=['label', 'label', 'x'])

Problem description

The "label" field is duplicated in the index of the result

Expected Output

I expect the index after the apply to be the same as before, ie to only contain "label" once

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-42-generic
Version : #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1
Cython : 0.29.15
pytest : 5.4.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : 0.8.6
xarray : None
xlrd : None
xlwt : None
numba : None

@aecay aecay added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020
@rhshadrach
Copy link
Member

Thanks for reporting this. You can drop "label" to get the desired output:

series2 = series.groupby(["label"]).apply(lambda s: s.droplevel('label')[2:])

producing

label  x
bar    2    3
       3    4
       4    5
foo    2    3
       3    4
       4    5
Name: y, dtype: int64

It is included in the argument s, and then joined on again within apply, which is why you're getting it twice.

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map Groupby labels Sep 3, 2020
@TomAugspurger
Copy link
Contributor

I think this is effectively a duplicate of #34809. #34998 will help give more deterministic outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants