REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types #38982

BryanCutler · 2021-01-05T23:02:23Z

closes REGR: ExtensionArray aggregation on non-numeric types fails #38980
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

BryanCutler · 2021-01-05T23:05:58Z

Exception handling that checks for this particular message in order to fallback to non-cython impl is here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L1047

I can work on adding a unit test for this

jbrockmendel · 2021-01-05T23:32:56Z

need tests

jorisvandenbossche · 2021-01-06T12:40:56Z

@BryanCutler this is a small example that I was testing that worked before and is failing now, so you can use that as a base for writing tests if you like

In [1]: df = pd.DataFrame({'key': ['a', 'a', 'b', 'b'], 'val': ['a', 'b', 'c', 'd']}, dtype="string")

In [2]: df.groupby("key").agg({'val': 'first'})  # or 'min'
Out[2]: 
    val
key    
a     a
b     c

Specifying 'sum' doesn't work in case of the "string" dtype, since it raises an error (because for now we don't allow "sum" on a string type, something we probably should). But "sum" apparently takes a slightly different route and doesn't try the cython_agg.

Also with the decimal test extension type that is included in our tests for such testing purposes, it can be reproduced:

from pandas.tests.extension.decimal import DecimalArray, make_data
df = pd.DataFrame({'key': ['a', 'a', 'b', 'b'], 'val': DecimalArray(make_data()[:4])})
df.groupby("key").agg({'val': 'min'})

jreback

yep pls add tests

BryanCutler · 2021-01-06T19:53:04Z

Thanks for the code samples @jorisvandenbossche. I tried to add a base extension test to cover more types and looks good for all except decimal and json, which give the following error:

E           AssertionError: ExtensionArray are different
E           
E           Attribute "dtype" are different
E           [left]:  object
E           [right]: decimal

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=51924&view=logs&j=2d7fb38a-2053-50f3-a67c-09f6e91d3121&t=449937cc-3d50-56b5-5662-e489f41f1268&l=177

Not sure if it's a problem with wrapping back up the extension array, or maybe these tests need a specialization. I'll look into it a little later..

pandas/tests/extension/decimal/array.py

pandas/tests/extension/json/array.py

jreback

looks like failing on 32-bit

pandas/tests/extension/decimal/array.py

jorisvandenbossche

Thanks! Two small suggestions

doc/source/whatsnew/v1.2.1.rst

jorisvandenbossche · 2021-01-11T08:15:10Z

pandas/tests/extension/base/groupby.py

+        result = df.groupby("A").agg({"B": "first"}).B.array
+
+        expected = df["B"].iloc[[0, 2, 4, 7]].array
+
+        self.assert_extension_array_equal(result, expected)


Suggested change

result = df.groupby("A").agg({"B": "first"}).B.array

expected = df["B"].iloc[[0, 2, 4, 7]].array

self.assert_extension_array_equal(result, expected)

expected = df["B"].iloc[[0, 2, 4, 7]].array

result = df.groupby("A").agg({"B": "first"}).B.array

self.assert_extension_array_equal(result, expected)

result = df.groupby("A").agg("first").B.array

self.assert_extension_array_equal(result, expected)

result = df.groupby("A").first().B.array

self.assert_extension_array_equal(result, expected)

Those different ways apparently take a somewhat different code path, given that the two added ones actually still worked (but so worth testing those different ways)

pandas/tests/extension/base/groupby.py

BryanCutler · 2021-01-12T20:30:53Z

@jreback I addressed the feeback and tests have passed. Please take another look when you can, thanks!

jreback · 2021-01-13T13:18:45Z

thanks @BryanCutler very nice!

jreback · 2021-01-13T13:18:54Z

@meeseeksdev backport 1.2.x

…y aggregation on non-numeric types

jorisvandenbossche · 2021-01-13T13:21:39Z

Thanks @BryanCutler !

…ion on non-numeric types (#39145) Co-authored-by: Bryan Cutler <[email protected]>

BryanCutler · 2021-01-14T19:08:51Z

Thank you all for reviewing!

…ypes (pandas-dev#38982)

BryanCutler mentioned this pull request Jan 5, 2021

Fix test failures with Pandas 1.2.0 CODAIT/text-extensions-for-pandas#157

Merged

jorisvandenbossche changed the title ~~[#38980] Bug fix for ExtensionArray groupby aggregation on non-numeric types~~ REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types Jan 6, 2021

jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 6, 2021

jorisvandenbossche added this to the 1.2.1 milestone Jan 6, 2021

jorisvandenbossche mentioned this pull request Jan 6, 2021

REGR: ExtensionArray aggregation on non-numeric types fails #38980

Closed

3 tasks

jreback requested changes Jan 6, 2021

View reviewed changes

BryanCutler commented Jan 9, 2021

View reviewed changes

pandas/tests/extension/decimal/array.py Outdated Show resolved Hide resolved

BryanCutler commented Jan 9, 2021

View reviewed changes

pandas/tests/extension/json/array.py Outdated Show resolved Hide resolved

jreback requested changes Jan 9, 2021

View reviewed changes

pandas/tests/extension/decimal/array.py Outdated Show resolved Hide resolved

BryanCutler mentioned this pull request Jan 11, 2021

BUG: Groupby agg on decimal,json extension arrays changes dtype to object #39098

Closed

3 tasks

BryanCutler force-pushed the ea-agg-fallback-fix-38980 branch 2 times, most recently from e3e02a4 to ec4bdc3 Compare January 11, 2021 05:41

jorisvandenbossche reviewed Jan 11, 2021

View reviewed changes

jreback requested changes Jan 11, 2021

View reviewed changes

pandas/tests/extension/base/groupby.py Outdated Show resolved Hide resolved

BryanCutler added 5 commits January 11, 2021 22:01

Add error message including expected string to properly fallback

a2d5f95

Added base test for extension array groupby agg

b826e7d

Marked tests for decimal,json as xfail

579d8e2

Compare resulting DataFrame in tests

00181c9

Added whatsnew entry

6ba43dc

BryanCutler force-pushed the ea-agg-fallback-fix-38980 branch from dd77772 to 6ba43dc Compare January 12, 2021 06:05

jreback approved these changes Jan 13, 2021

View reviewed changes

jreback merged commit 396131a into pandas-dev:master Jan 13, 2021

meeseeksmachine mentioned this pull request Jan 13, 2021

Backport PR #38982 on branch 1.2.x (REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types) #39145

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 13, 2021

Backport PR pandas-dev#38982: REGR: Bug fix for ExtensionArray groupb…

5553c47

…y aggregation on non-numeric types

jreback pushed a commit that referenced this pull request Jan 13, 2021

Backport PR #38982: REGR: Bug fix for ExtensionArray groupby aggregat…

673b333

…ion on non-numeric types (#39145) Co-authored-by: Bryan Cutler <[email protected]>

BryanCutler deleted the ea-agg-fallback-fix-38980 branch January 14, 2021 19:08

BryanCutler mentioned this pull request Jan 14, 2021

bert.align_bert_tokens_to_corpus_tokens() fails with Pandas 1.2.0 CODAIT/text-extensions-for-pandas#164

Closed

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric t…

7e0eb0d

…ypes (pandas-dev#38982)

BryanCutler mentioned this pull request Mar 19, 2021

bert align_bert_tokens_to_corpus_tokens fails with Pandas 1.2.0 CODAIT/text-extensions-for-pandas#163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types #38982

REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types #38982

BryanCutler commented Jan 5, 2021 •

edited

Loading

BryanCutler commented Jan 5, 2021

jbrockmendel commented Jan 5, 2021

jorisvandenbossche commented Jan 6, 2021

jreback left a comment

BryanCutler commented Jan 6, 2021 •

edited

Loading

jreback left a comment

jorisvandenbossche left a comment

jorisvandenbossche Jan 11, 2021

BryanCutler commented Jan 12, 2021

jreback commented Jan 13, 2021

jreback commented Jan 13, 2021

jorisvandenbossche commented Jan 13, 2021

BryanCutler commented Jan 14, 2021

REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types #38982

REGR: Bug fix for ExtensionArray groupby aggregation on non-numeric types #38982

Conversation

BryanCutler commented Jan 5, 2021 • edited Loading

BryanCutler commented Jan 5, 2021

jbrockmendel commented Jan 5, 2021

jorisvandenbossche commented Jan 6, 2021

jreback left a comment

Choose a reason for hiding this comment

BryanCutler commented Jan 6, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 11, 2021

Choose a reason for hiding this comment

BryanCutler commented Jan 12, 2021

jreback commented Jan 13, 2021

jreback commented Jan 13, 2021

jorisvandenbossche commented Jan 13, 2021

BryanCutler commented Jan 14, 2021

BryanCutler commented Jan 5, 2021 •

edited

Loading

BryanCutler commented Jan 6, 2021 •

edited

Loading