Grouping-aggregation with first() discards categoricals column #22512

zoquda · 2018-08-26T16:57:02Z

Code Sample

# Two dataframes that are identical, except for one dataframe having a categoricals column
df1 = pd.DataFrame({
    'A': [1, 1, 1, 2, 2],
    'B': [100, 100, 200, 100, 100],
    'C': ['apple', 'orange', 'mango', 'mango', 'orange'],
    'D': ['jupiter', 'mercury', 'mars', 'venus', 'venus'],
})
df2 = df1.astype({'D': 'category'})

# These groupby-aggregations all give results as expected
expected_result_1 = df1.groupby(by='A').first()
expected_result_2 = df2.groupby(by='A').first()
expected_result_3 = df1.groupby(by=['A', 'B']).first()
expected_result_4 = df1.groupby(by=['A', 'B']).head(1)
expected_result_5 = df2.groupby(by=['A', 'B']).head(1)

# This groupby-aggregation gives an unexpected result
unexpected_result = df2.groupby(by=['A', 'B']).first()

with the result dataframes looking as follows:

In [1]: expected_result_1
Out[1]:
     B      C        D
A
1  100  apple  jupiter
2  100  mango    venus

In [2]: expected_result_2
Out[2]:
     B      C        D
A
1  100  apple  jupiter
2  100  mango    venus

In [3]: expected_result_3
Out[3]:
           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

In [4]: expected_result_4
Out[4]:
   A    B      C        D
0  1  100  apple  jupiter
2  1  200  mango     mars
3  2  100  mango    venus

In [5]: expected_result_5
Out[5]:
   A    B      C        D
0  1  100  apple  jupiter
2  1  200  mango     mars
3  2  100  mango    venus

In [6]: unexpected_result
Out[6]:
           C
A B
1 100  apple
  200  mango
2 100  mango

Problem description

A grouping-aggregation operation with first() as aggregation function and multiple columns as by seems to discard categorical columns (see the unexpected_result dataframe in the above code example). Besides being unexpected, this behaviour also seems inconsistent with the other similar grouping-aggregation operations above.

Expected Output

The result of df2.groupby(by=['A', 'B']).first() would be expected as:

           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

The older pandas version 0.21.0 reportedly does produce this expected output.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.7.2
pip: 10.0.1
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.7
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.5
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.7
lxml: 4.2.4
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-08-27T09:52:00Z

The older pandas version 0.21.0 reportedly does produce this expected output.

@zoquda : Thanks for reporting this! When you say, "reportedly", can you confirm if it works in 0.21.0, and if so, can you confirm that that is the last version where your examples work?

zoquda · 2018-08-27T12:38:49Z

@gfyoung Certainly. I've repeated the test with different pandas versions. The issue occurs from pandas version 0.23.0 onwards. In pandas version 0.22.0 all results are still as expected.

gfyoung · 2018-08-27T13:03:15Z

Ah, so this looks like a regression then. Patch and PR are welcome then!

elmq0022 · 2018-09-18T01:28:23Z

I'll take a look at this one. Thanks!

MarvinT · 2018-09-18T14:44:08Z

another similar example:

import pandas as pd
print(pd.__version__)
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9], 'd': ['test', 'test', 'not test']})
print(df.groupby(['a','b'], observed=True).agg(lambda x: x.iloc[0]))
df['d'] = df['d'].astype('category')
print(df.groupby(['a','b'], observed=True).agg(lambda x: x.iloc[0]))

yields

0.23.4
     c         d
a b             
x 0  7      test
  1  8      test
y 0  9  not test
     c
a b   
x 0  7
  1  8
y 0  9

elmq0022 · 2018-10-29T02:59:27Z

I haven't had much time to work on extra stuff lately. I will try to swing back to this soon.

mroeschke · 2020-05-11T01:04:10Z

This appears to work on master now. Could use a test

In [9]: df1 = pd.DataFrame({
   ...:     'A': [1, 1, 1, 2, 2],
   ...:     'B': [100, 100, 200, 100, 100],
   ...:     'C': ['apple', 'orange', 'mango', 'mango', 'orange'],
   ...:     'D': ['jupiter', 'mercury', 'mars', 'venus', 'venus'],
   ...: })
   ...: df2 = df1.astype({'D': 'category'})

In [10]: unexpected_result = df2.groupby(by=['A', 'B']).first()
    ...:

In [11]: unexpected_result
Out[11]:
           C        D
A B
1 100  apple  jupiter
  200  mango     mars
2 100  mango    venus

diptiw · 2020-11-10T03:24:16Z

I'll take!

diptiw · 2020-12-08T03:17:36Z

Hi, I've started making tests for the aggregate first() function using DFrames with different columns types. I was wondering if addition tests need to be made for other aggregate functions with categorical columns for this issue?

gfyoung added Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 27, 2018

gfyoung added the Regression Functionality that used to work in a prior pandas version label Aug 27, 2018

gfyoung added this to the 0.23.5 milestone Aug 27, 2018

jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

TomAugspurger mentioned this issue Jan 3, 2019

DEPR: __array__ for tz-aware Series/Index #24596

Merged

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Groupby Regression Functionality that used to work in a prior pandas version labels May 11, 2020

jbrockmendel added Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Categorical Categorical Data Type labels Sep 21, 2020

mroeschke mentioned this issue May 29, 2021

TST: More old issues #41712

Merged

15 tasks

mroeschke modified the milestones: Contributions Welcome, 1.3 May 29, 2021

jreback closed this as completed in #41712 May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping-aggregation with first() discards categoricals column #22512

Grouping-aggregation with first() discards categoricals column #22512

zoquda commented Aug 26, 2018 •

edited

Loading

INSTALLED VERSIONS

gfyoung commented Aug 27, 2018 •

edited

Loading

zoquda commented Aug 27, 2018

gfyoung commented Aug 27, 2018

elmq0022 commented Sep 18, 2018

MarvinT commented Sep 18, 2018

elmq0022 commented Oct 29, 2018

mroeschke commented May 11, 2020

diptiw commented Nov 10, 2020

diptiw commented Dec 8, 2020 •

edited

Loading

Grouping-aggregation with first() discards categoricals column #22512

Grouping-aggregation with first() discards categoricals column #22512

Comments

zoquda commented Aug 26, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Aug 27, 2018 • edited Loading

zoquda commented Aug 27, 2018

gfyoung commented Aug 27, 2018

elmq0022 commented Sep 18, 2018

MarvinT commented Sep 18, 2018

elmq0022 commented Oct 29, 2018

mroeschke commented May 11, 2020

diptiw commented Nov 10, 2020

diptiw commented Dec 8, 2020 • edited Loading

zoquda commented Aug 26, 2018 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Aug 27, 2018 •

edited

Loading

diptiw commented Dec 8, 2020 •

edited

Loading