BUG: groupby agg fails silently with mixed dtypes #43213

debnathshoham · 2021-08-25T18:49:07Z

closes BUG: groupby aggregation methods produce empty DataFrame with mixed DataTypes #43209
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

simonjayhawkins · 2021-08-25T19:01:39Z

pandas/tests/groupby/test_groupby.py

+    ).astype({("a", "j"): dtype, ("b", "j"): dtype})
+    result = df.groupby(level=1, axis=1).sum()
+    expected = DataFrame(
+        [[5, 7, 9], [5, 7, 9], [5, 7, 9]], columns=["i", "j", "k"], dtype=dtype


is the expected also a mixed dtype dataframe?

here the expected would be either entirely Int64 or int64

is this test representative of the reported issue?

yes, it's the exact same example (mixed int one)

pandas/tests/groupby/test_groupby.py

pandas/core/groupby/generic.py

pandas/tests/groupby/test_groupby.py

jreback · 2021-08-31T14:57:55Z

cc @rhshadrach @jbrockmendel for a look

doc/source/whatsnew/v1.3.3.rst

pandas/core/groupby/generic.py

jreback · 2021-08-31T22:42:07Z

ok 2 comments @debnathshoham pls ping when reasolved & green

debnathshoham · 2021-09-01T07:18:09Z

@jreback added comments to both places + Green

simonjayhawkins

Thanks @debnathshoham I'm not sure that this should be backported as I don't understand the reasoning for the change in behavior. see #43213 (comment)

simplified code sample

import pandas as pd

print(pd.__version__)
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]] * 3)
df.columns = pd.MultiIndex.from_product([["a", "b"], ["i", "j", "k"]])
for col in df.columns:
    if col[-1] == "j":
        df[col] = df[col].astype("Int64")
print("mixed int:")
result = df.groupby(level=1, axis=1).sum()
print(result)
print(result.dtypes)

on 1.2.5

1.2.5
mixed int:
   i    j  k
0  5  7.0  9
1  5  7.0  9
2  5  7.0  9
i      int64
j    float64
k      int64
dtype: object

this PR

1.4.0.dev0+537.g8d1bfb1cb9
mixed int:
   i  j  k
0  5  7  9
1  5  7  9
2  5  7  9
i    Int64
j    Int64
k    Int64
dtype: object

Can you explain why the dtypes of columns i and k are now Int64

debnathshoham · 2021-09-01T12:38:13Z

sorry, I didn't follow it in the first comment.

Actually in master, I don't think dtype of the output is being maintained in any of these funcs
'count', 'sum', 'std', 'var', 'sem', 'mean', 'median', 'prod', 'min', 'max'

Is the expectation that the output will have the original dtypes, i.e. int64, Int64, int64 ?

simonjayhawkins · 2021-09-01T13:11:03Z

Is the expectation that the output will have the original dtypes, i.e. int64, Int64, int64 ?

for the code sample, that would be reasonable, although returning float64 for the Int64 columns would probably also be acceptable as that was the return type before the regression.

How does the changes in PR relate to the changes in the PR that caused the regression #41706? (or to put it a different way, which change in #41706 caused the regression and how does this PR rectify that?)

debnathshoham · 2021-09-01T15:59:29Z

Before #41706, the actual calculation was happening in agg_general in result = self.aggregate(lambda x: npfunc(x, axis=self.axis)).
Honestly, I didn't try to undo what that patch did, I went on to fix the inconsistency in the result coming post that patch.

rhshadrach · 2021-09-02T01:23:07Z

Agreed @debnathshoham - on master, the dtypes in @simonjayhawkins' example come out at all float. What you have here is certainly an improvement over that, but it doesn't fully fix the regression. It's unclear to me if this patch is a good step in fully resolving the regression, or if fully solving it would make this patch obsolete. Will need to investigate more.

jbrockmendel · 2021-09-14T18:48:38Z

Doesn't the PR that changes DataFrame.transpose fix this? That seems like a much better option than reverting #41706, which fixed a bunch of bugs.

Again shout it from the rooftops: this is caused by the lack of 2D EAs.

debnathshoham · 2021-09-14T20:20:36Z

yes, but that was casting the the result df to the common_dtype (which wasn't matching 1.2.5 behaviour)

jreback · 2021-09-16T18:57:08Z

can you show what these test functions return in 1.2.5 vs what this PR gives. and show any difference in a comment.

debnathshoham · 2021-09-16T19:23:20Z

added comments in the tests.
To summarise, for df with mixed dtypes like below.
For sum - Columns with Int64 was cast to float64 in 1.2.5, whereas in this PR it would cast to int64.
For std - ValueError : Length mismatch in 1.2.5. In this PR they come as float64.
Fot var - DataError : No numeric types to aggregate in 1.2.5. In this PR they come as float64

In [32]:     df = DataFrame(
    ...:         np.arange(12).reshape(3, 4),
    ...:         index=Index([0, 1, 0], name="y"),
    ...:         columns=Index([10, 20, 10, 20], name="x"),
    ...:         dtype="int64",
    ...:     ).astype({10: "Int64"})

In [33]: df.dtypes
Out[33]: 
x
10    Int64
20    int64
10    Int64
20    int64
dtype: object

debnathshoham · 2021-09-25T18:39:22Z

@jreback @simonjayhawkins
wondering if you got a chance to look into this?
xref - #43501 (I think this is related)

rhshadrach

Minor request on the tests; otherwise this looks great to me. The code change is small and makes sense, and the dtypes are improved over 1.2.5.

rhshadrach · 2021-09-29T01:57:31Z

pandas/tests/groupby/test_groupby.py

+        # 1.2.5: ValueError: Length mismatch: Expected axis
+        # has 0 elements, new values have 3 elements
+        ("var", [4.5] * 3, int, {"i": float, "j": float, "k": float}),
+        # 1.2.5: DataError: No numeric types to aggregate
+        ("sum", [5, 7, 9], "Int64", {"j": "int64"}),
+        # 1.2.5: j:float64
+        ("std", [4.5 ** 0.5] * 3, "Int64", {"i": float, "j": float, "k": float}),
+        # 1.2.5: ValueError: Length mismatch: Expected axis
+        # has 0 elements, new values have 3 elements
+        ("var", [4.5] * 3, "Int64", {"i": "float64", "j": "float64", "k": "float64"}),
+        # 1.2.5: DataError: No numeric types to aggregate


Can you remove the comments here - appreciate the extra detail, but they shouldn't be part of a test; surfacing this in a comment on github directly would be great (and you already have another comment that does something similar - thanks for that as well).

jreback

request & pls merge master ping on green.

jreback · 2021-09-29T13:02:57Z

pandas/tests/groupby/test_groupby.py

@@ -2509,3 +2509,53 @@ def test_rolling_wrong_param_min_period():
    result_error_msg = r"__init__\(\) got an unexpected keyword argument 'min_period'"
    with pytest.raises(TypeError, match=result_error_msg):
        test_df.groupby("name")["val"].rolling(window=2, min_period=1).sum()
+
+
+@pytest.mark.parametrize(


move these to pandas/tests/groupby/aggregate/test_aggreagte.py (or maybe test_cython) see where similar are.

debnathshoham · 2021-09-29T18:14:09Z

@jreback Green

jreback · 2021-09-29T22:47:22Z

thanks @debnathshoham

jreback · 2021-09-29T22:47:29Z

@meeseeksdev backport 1.3.x

…xed dtypes

lumberbot-app · 2021-09-29T22:47:53Z

Something went wrong ... Please have a look at my logs.

…#43808) Co-authored-by: Shoham Debnath <[email protected]>

jbrockmendel · 2021-10-19T22:02:01Z

pandas/core/groupby/groupby.py

+                if self.axis:
+                    obj = self._obj_with_exclusions.T
+                else:
+                    obj = self._obj_with_exclusions


im a little late here, but could potentially use _get_data_to_aggregate here

debnathshoham added 2 commits August 26, 2021 00:14

BUG: groupby agg fails silently with mixed dtypes

91aa285

updated whatsnew

9b9b7af

simonjayhawkins added Dtype Conversions Unexpected or buggy dtype conversions Groupby Regression Functionality that used to work in a prior pandas version labels Aug 25, 2021

simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021

simonjayhawkins reviewed Aug 25, 2021

View reviewed changes

added tests for var and std

d968e57

jreback requested changes Aug 30, 2021

View reviewed changes

pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved

debnathshoham added 3 commits August 30, 2021 20:13

reorganized tests

46a29e0

Merge branch 'master' into gh43209

195db31

specified int64 in result

16e28db

jreback requested changes Aug 31, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved

added copy=False to astype

854ecda

debnathshoham mentioned this pull request Aug 31, 2021

BUG: groupby.std() casts to float64 #43330

Closed

3 tasks

jreback requested changes Aug 31, 2021

View reviewed changes

doc/source/whatsnew/v1.3.3.rst Outdated Show resolved Hide resolved

debnathshoham added 2 commits September 1, 2021 01:26

updated whatsnew

7dd27f3

typo corrected

309ee59

jbrockmendel reviewed Aug 31, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

rhshadrach mentioned this pull request Aug 31, 2021

BUG: transpose unnecessarily returning object #43337

Open

jreback approved these changes Aug 31, 2021

View reviewed changes

added issue ref

8d1bfb1

simonjayhawkins requested changes Sep 1, 2021

View reviewed changes

changes wrt 1.2.5

00838be

debnathshoham added 3 commits September 17, 2021 00:57

Merge branch 'master' into gh43209

a04d021

Merge branch 'master' into gh43209

c9d6658

Merge branch 'master' into gh43209

5db7155

Merge branch 'master' into gh43209

5ccf385

rhshadrach requested changes Sep 29, 2021

View reviewed changes

debnathshoham added 3 commits September 29, 2021 12:18

Merge branch 'master' into gh43209

9afd465

removed comments highlighting diff with 1.2.5 from test

600b71e

removed comments from test2

fe13278

debnathshoham requested a review from rhshadrach September 29, 2021 08:30

jreback requested changes Sep 29, 2021

View reviewed changes

debnathshoham added 2 commits September 29, 2021 22:34

moved tests to test_aggregate

1aaf326

Merge branch 'master' into gh43209

4ba8511

jreback approved these changes Sep 29, 2021

View reviewed changes

jreback merged commit 2e5ed86 into pandas-dev:master Sep 29, 2021

meeseeksmachine mentioned this pull request Sep 29, 2021

Backport PR #43213 on branch 1.3.x (BUG: groupby agg fails silently with mixed dtypes) #43808

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Sep 29, 2021

Backport PR pandas-dev#43213: BUG: groupby agg fails silently with mi…

248fb90

…xed dtypes

debnathshoham deleted the gh43209 branch September 30, 2021 04:30

phofl pushed a commit that referenced this pull request Sep 30, 2021

Backport PR #43213: BUG: groupby agg fails silently with mixed dtypes (…

23ed54f

…#43808) Co-authored-by: Shoham Debnath <[email protected]>

gasparitiago pushed a commit to gasparitiago/pandas that referenced this pull request Oct 9, 2021

BUG: groupby agg fails silently with mixed dtypes (pandas-dev#43213)

6998bc4

jbrockmendel reviewed Oct 19, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby agg fails silently with mixed dtypes #43213

BUG: groupby agg fails silently with mixed dtypes #43213

debnathshoham commented Aug 25, 2021

simonjayhawkins Aug 25, 2021

debnathshoham Aug 25, 2021

simonjayhawkins Aug 25, 2021

debnathshoham Aug 25, 2021

jreback commented Aug 31, 2021

jreback commented Aug 31, 2021

debnathshoham commented Sep 1, 2021

simonjayhawkins left a comment

debnathshoham commented Sep 1, 2021

simonjayhawkins commented Sep 1, 2021

debnathshoham commented Sep 1, 2021

rhshadrach commented Sep 2, 2021

jbrockmendel commented Sep 14, 2021

debnathshoham commented Sep 14, 2021

jreback commented Sep 16, 2021

debnathshoham commented Sep 16, 2021

debnathshoham commented Sep 25, 2021

rhshadrach left a comment

rhshadrach Sep 29, 2021

jreback left a comment

jreback Sep 29, 2021

debnathshoham commented Sep 29, 2021

jreback commented Sep 29, 2021

jreback commented Sep 29, 2021

lumberbot-app bot commented Sep 29, 2021

jbrockmendel Oct 19, 2021

BUG: groupby agg fails silently with mixed dtypes #43213

BUG: groupby agg fails silently with mixed dtypes #43213

Conversation

debnathshoham commented Aug 25, 2021

simonjayhawkins Aug 25, 2021

Choose a reason for hiding this comment

debnathshoham Aug 25, 2021

Choose a reason for hiding this comment

simonjayhawkins Aug 25, 2021

Choose a reason for hiding this comment

debnathshoham Aug 25, 2021

Choose a reason for hiding this comment

jreback commented Aug 31, 2021

jreback commented Aug 31, 2021

debnathshoham commented Sep 1, 2021

simonjayhawkins left a comment

Choose a reason for hiding this comment

debnathshoham commented Sep 1, 2021

simonjayhawkins commented Sep 1, 2021

debnathshoham commented Sep 1, 2021

rhshadrach commented Sep 2, 2021

jbrockmendel commented Sep 14, 2021

debnathshoham commented Sep 14, 2021

jreback commented Sep 16, 2021

debnathshoham commented Sep 16, 2021

debnathshoham commented Sep 25, 2021

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Sep 29, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Sep 29, 2021

Choose a reason for hiding this comment

debnathshoham commented Sep 29, 2021

jreback commented Sep 29, 2021

jreback commented Sep 29, 2021

lumberbot-app bot commented Sep 29, 2021

jbrockmendel Oct 19, 2021

Choose a reason for hiding this comment