Skip to content

Groupby on empty frame different results based on dtypes. #20888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amanhanda opened this issue Apr 30, 2018 · 3 comments · Fixed by #29455
Closed

Groupby on empty frame different results based on dtypes. #20888

amanhanda opened this issue Apr 30, 2018 · 3 comments · Fixed by #29455
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@amanhanda
Copy link

Code Sample, a copy-pastable example if possible

In [117]: import pandas as pd

In [118]: df = pd.DataFrame({"a":[1], "b":[2], "c":[3], "d":[4]})

In [119]: group_keys = ["a", "b", "c"]

In [120]: g = df[df.a==2].groupby(group_keys)

In [121]: g.first().index
Out[121]:
MultiIndex(levels=[[], [], []],
           labels=[[], [], []],
           names=[u'a', u'b', u'c'])
# Change the dtype for "d"
In [122]: df = pd.DataFrame({"a":[1], "b":[2], "c":[3], "d":["d"]})

In [123]: g = df[df.a==2].groupby(group_keys)

In [124]: g.first().index
Out[124]: Index([], dtype='object')

# Version 0.18
In [36]: g.first().index
Out[36]:
MultiIndex(levels=[[], [], []],
           labels=[[], [], []],
           names=[u'a', u'b', u'c'])

Problem description

The groupby should return multi-index in both cases. In one case when we have all int64 data, the resultant groupby on an empty frame returns the multi-index. When we have a str (object) dtype, the return is an empty index. Not consistent. Previous version, 0.18.0 returned multi-index in both cases.

Expected Output

In [36]: g.first().index
Out[36]:
MultiIndex(levels=[[], [], []],
           labels=[[], [], []],
           names=[u'a', u'b', u'c'])

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Apr 30, 2018

Can confirm this was happening in 0.22 but looks to be fine on master

@amanhanda
Copy link
Author

What would be the prescribed workaround in 0.22?

@mroeschke
Copy link
Member

Yup looks to work on master. Could use a test.

In [181]: In [117]: import pandas as pd
     ...:
     ...: In [118]: df = pd.DataFrame({"a":[1], "b":[2], "c":[3], "d":[4]})
     ...:
     ...: In [119]: group_keys = ["a", "b", "c"]
     ...:
     ...: In [120]: g = df[df.a==2].groupby(group_keys)
     ...:
     ...: In [121]: g.first().index
Out[181]: MultiIndex([], names=['a', 'b', 'c'])

In [182]: In [122]: df = pd.DataFrame({"a":[1], "b":[2], "c":[3], "d":["d"]})
     ...:

In [183]: In [123]: g = df[df.a==2].groupby(group_keys)
     ...:

In [184]: In [124]: g.first().index
     ...:
Out[184]: MultiIndex([], names=['a', 'b', 'c'])

In [185]: In [36]: g.first().index
     ...:
Out[185]: MultiIndex([], names=['a', 'b', 'c'])

In [186]: pd.__version__
Out[186]: '0.26.0.dev0+682.g08ab156eb'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants