Skip to content

Grouping-aggregation with first() discards categoricals column #22512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zoquda opened this issue Aug 26, 2018 · 9 comments · Fixed by #41712
Closed

Grouping-aggregation with first() discards categoricals column #22512

zoquda opened this issue Aug 26, 2018 · 9 comments · Fixed by #41712
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Milestone

Comments

@zoquda
Copy link

zoquda commented Aug 26, 2018

Code Sample

# Two dataframes that are identical, except for one dataframe having a categoricals column
df1 = pd.DataFrame({
    'A': [1, 1, 1, 2, 2],
    'B': [100, 100, 200, 100, 100],
    'C': ['apple', 'orange', 'mango', 'mango', 'orange'],
    'D': ['jupiter', 'mercury', 'mars', 'venus', 'venus'],
})
df2 = df1.astype({'D': 'category'})

# These groupby-aggregations all give results as expected
expected_result_1 = df1.groupby(by='A').first()
expected_result_2 = df2.groupby(by='A').first()
expected_result_3 = df1.groupby(by=['A', 'B']).first()
expected_result_4 = df1.groupby(by=['A', 'B']).head(1)
expected_result_5 = df2.groupby(by=['A', 'B']).head(1)

# This groupby-aggregation gives an unexpected result
unexpected_result = df2.groupby(by=['A', 'B']).first()

with the result dataframes looking as follows:

In [1]: expected_result_1
Out[1]:
     B      C        D
A
1  100  apple  jupiter
2  100  mango    venus

In [2]: expected_result_2
Out[2]:
     B      C        D
A
1  100  apple  jupiter
2  100  mango    venus

In [3]: expected_result_3
Out[3]:
           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

In [4]: expected_result_4
Out[4]:
   A    B      C        D
0  1  100  apple  jupiter
2  1  200  mango     mars
3  2  100  mango    venus

In [5]: expected_result_5
Out[5]:
   A    B      C        D
0  1  100  apple  jupiter
2  1  200  mango     mars
3  2  100  mango    venus

In [6]: unexpected_result
Out[6]:
           C
A B
1 100  apple
  200  mango
2 100  mango

Problem description

A grouping-aggregation operation with first() as aggregation function and multiple columns as by seems to discard categorical columns (see the unexpected_result dataframe in the above code example). Besides being unexpected, this behaviour also seems inconsistent with the other similar grouping-aggregation operations above.

See also this stackoverflow page.

Expected Output

The result of df2.groupby(by=['A', 'B']).first() would be expected as:

           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

The older pandas version 0.21.0 reportedly does produce this expected output.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.2
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.7
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.7
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Aug 27, 2018

The older pandas version 0.21.0 reportedly does produce this expected output.

@zoquda : Thanks for reporting this! When you say, "reportedly", can you confirm if it works in 0.21.0, and if so, can you confirm that that is the last version where your examples work?

@gfyoung gfyoung added Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 27, 2018
@zoquda
Copy link
Author

zoquda commented Aug 27, 2018

@gfyoung Certainly. I've repeated the test with different pandas versions. The issue occurs from pandas version 0.23.0 onwards. In pandas version 0.22.0 all results are still as expected.

@gfyoung
Copy link
Member

gfyoung commented Aug 27, 2018

Ah, so this looks like a regression then. Patch and PR are welcome then!

@gfyoung gfyoung added the Regression Functionality that used to work in a prior pandas version label Aug 27, 2018
@gfyoung gfyoung added this to the 0.23.5 milestone Aug 27, 2018
@elmq0022
Copy link
Contributor

I'll take a look at this one. Thanks!

@MarvinT
Copy link

MarvinT commented Sep 18, 2018

another similar example:

import pandas as pd
print(pd.__version__)
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9], 'd': ['test', 'test', 'not test']})
print(df.groupby(['a','b'], observed=True).agg(lambda x: x.iloc[0]))
df['d'] = df['d'].astype('category')
print(df.groupby(['a','b'], observed=True).agg(lambda x: x.iloc[0]))

yields

0.23.4
     c         d
a b             
x 0  7      test
  1  8      test
y 0  9  not test
     c
a b   
x 0  7
  1  8
y 0  9

@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@elmq0022
Copy link
Contributor

I haven't had much time to work on extra stuff lately. I will try to swing back to this soon.

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018
@mroeschke
Copy link
Member

This appears to work on master now. Could use a test

In [9]: df1 = pd.DataFrame({
   ...:     'A': [1, 1, 1, 2, 2],
   ...:     'B': [100, 100, 200, 100, 100],
   ...:     'C': ['apple', 'orange', 'mango', 'mango', 'orange'],
   ...:     'D': ['jupiter', 'mercury', 'mars', 'venus', 'venus'],
   ...: })
   ...: df2 = df1.astype({'D': 'category'})

In [10]: unexpected_result = df2.groupby(by=['A', 'B']).first()
    ...:

In [11]: unexpected_result
Out[11]:
           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Regression Functionality that used to work in a prior pandas version labels May 11, 2020
@jbrockmendel jbrockmendel added Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Categorical Categorical Data Type labels Sep 21, 2020
@diptiw
Copy link

diptiw commented Nov 10, 2020

I'll take!

@diptiw
Copy link

diptiw commented Dec 8, 2020

Hi, I've started making tests for the aggregate first() function using DFrames with different columns types. I was wondering if addition tests need to be made for other aggregate functions with categorical columns for this issue?

@mroeschke mroeschke mentioned this issue May 29, 2021
15 tasks
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants