Skip to content

BUG: Pandas groupby indices behaving diferrently with 2 and 3 rows #18451

Closed
@mcdallas

Description

@mcdallas

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame({
    'Company': ['Foo Inc.', 'Foo Inc.', 'Foo Inc.'],
    'ID': ['123456', '123456', '123456'],
    'Employee': ['John Doe', 'Richard Roe', 'Jane Doe'],
    'Position': ['Executive Director', 'Director', 'Company Secretary']
})
    
df2 = pd.DataFrame({
    'Company': ['Bar Inc.', 'Bar Inc.'],
    'ID': ['56789', '56789'],
    'Employee': ['Mark Moe', 'Larry Loe'],
    'Position': ['Tax Consultant', 'Company Secretary']
})

print(df1)
    Company     Employee      ID            Position
0  Foo Inc.     John Doe  123456  Executive Director
1  Foo Inc.  Richard Roe  123456            Director
2  Foo Inc.     Jane Doe  123456   Company Secretary

print(df2)
    Company   Employee     ID           Position
0  Bar Inc.   Mark Moe  56789     Tax Consultant
1  Bar Inc.  Larry Loe  56789  Company Secretary

gb1 = df1.set_index(['Company', 'ID', 'Employee']).groupby(['Company', 'ID'])
gb2 = df2.set_index(['Company', 'ID', 'Employee']).groupby(['Company', 'ID'])
    
for (name, id), new_df in gb1:
    print(name)
    print(id)
    
for (name, id), new_df in gb2:
    print(name)
    print(id)

Foo Inc.
123456

      3     print(id)
      4
----> 5 for (name, id), new_df in gb2:
      6     print(name)
      7     print(id)

ValueError: too many values to unpack (expected 2)

Problem description

I have 2 dataframes df1 and df2. Their format is the same with the only difference that the first has 3 rows and the second 2.

When I try to groupby and run the loop above it works for the first but not for the second.
This is because their indices are different

gb1.indices
>>> {('Foo Inc.', '123456'): array([0, 1, 2], dtype=int64)}

gb2.indices
>>> {'Company': array([0], dtype=int64), 'ID': array([1], dtype=int64)}

the code above works if I replace the groupby line with

gb2 = df2.set_index(['Company', 'ID', 'Employee']).groupby(level=['Company', 'ID'])

Expected Output

The output should be consistent in both cases.

Output of pd.show_versions()

pandas: 0.20.1
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

Labels

GroupbyNeeds TestsUnit test(s) needed to prevent regressions

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions