Skip to content

REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types #38982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 13, 2021

Conversation

BryanCutler
Copy link
Contributor

@BryanCutler BryanCutler commented Jan 5, 2021

@BryanCutler
Copy link
Contributor Author

Exception handling that checks for this particular message in order to fallback to non-cython impl is here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L1047

I can work on adding a unit test for this

@jbrockmendel
Copy link
Member

need tests

@jorisvandenbossche jorisvandenbossche changed the title [#38980] Bug fix for ExtensionArray groupby aggregation on non-numeric types REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types Jan 6, 2021
@jorisvandenbossche jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 6, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Jan 6, 2021
@jorisvandenbossche
Copy link
Member

@BryanCutler this is a small example that I was testing that worked before and is failing now, so you can use that as a base for writing tests if you like

In [1]: df = pd.DataFrame({'key': ['a', 'a', 'b', 'b'], 'val': ['a', 'b', 'c', 'd']}, dtype="string")

In [2]: df.groupby("key").agg({'val': 'first'})  # or 'min'
Out[2]: 
    val
key    
a     a
b     c

Specifying 'sum' doesn't work in case of the "string" dtype, since it raises an error (because for now we don't allow "sum" on a string type, something we probably should). But "sum" apparently takes a slightly different route and doesn't try the cython_agg.

Also with the decimal test extension type that is included in our tests for such testing purposes, it can be reproduced:

from pandas.tests.extension.decimal import DecimalArray, make_data
df = pd.DataFrame({'key': ['a', 'a', 'b', 'b'], 'val': DecimalArray(make_data()[:4])})
df.groupby("key").agg({'val': 'min'})

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep pls add tests

@BryanCutler
Copy link
Contributor Author

BryanCutler commented Jan 6, 2021

Thanks for the code samples @jorisvandenbossche. I tried to add a base extension test to cover more types and looks good for all except decimal and json, which give the following error:

E           AssertionError: ExtensionArray are different
E           
E           Attribute "dtype" are different
E           [left]:  object
E           [right]: decimal

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=51924&view=logs&j=2d7fb38a-2053-50f3-a67c-09f6e91d3121&t=449937cc-3d50-56b5-5662-e489f41f1268&l=177

Not sure if it's a problem with wrapping back up the extension array, or maybe these tests need a specialization. I'll look into it a little later..

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like failing on 32-bit

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Two small suggestions

Comment on lines 39 to 43
result = df.groupby("A").agg({"B": "first"}).B.array

expected = df["B"].iloc[[0, 2, 4, 7]].array

self.assert_extension_array_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = df.groupby("A").agg({"B": "first"}).B.array
expected = df["B"].iloc[[0, 2, 4, 7]].array
self.assert_extension_array_equal(result, expected)
expected = df["B"].iloc[[0, 2, 4, 7]].array
result = df.groupby("A").agg({"B": "first"}).B.array
self.assert_extension_array_equal(result, expected)
result = df.groupby("A").agg("first").B.array
self.assert_extension_array_equal(result, expected)
result = df.groupby("A").first().B.array
self.assert_extension_array_equal(result, expected)

Those different ways apparently take a somewhat different code path, given that the two added ones actually still worked (but so worth testing those different ways)

@BryanCutler BryanCutler force-pushed the ea-agg-fallback-fix-38980 branch from dd77772 to 6ba43dc Compare January 12, 2021 06:05
@BryanCutler
Copy link
Contributor Author

@jreback I addressed the feeback and tests have passed. Please take another look when you can, thanks!

@jreback jreback merged commit 396131a into pandas-dev:master Jan 13, 2021
@jreback
Copy link
Contributor

jreback commented Jan 13, 2021

thanks @BryanCutler very nice!

@jreback
Copy link
Contributor

jreback commented Jan 13, 2021

@meeseeksdev backport 1.2.x

@jorisvandenbossche
Copy link
Member

Thanks @BryanCutler !

jreback pushed a commit that referenced this pull request Jan 13, 2021
@BryanCutler
Copy link
Contributor Author

Thank you all for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: ExtensionArray aggregation on non-numeric types fails
4 participants