Skip to content

BUG: Fix min_count issue for groupby.sum #32914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
2a3f814
Add test
dsaxton Mar 22, 2020
3ce0384
Check for null
dsaxton Mar 22, 2020
e25c823
Release note
dsaxton Mar 22, 2020
2656bc3
Update and comment
dsaxton Mar 22, 2020
300b1e5
Update test
dsaxton Mar 23, 2020
e2e08e2
Hack
dsaxton Mar 23, 2020
ea59d54
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 24, 2020
ab1050f
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 25, 2020
98f0922
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 26, 2020
ea1ef36
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 26, 2020
cdf7dfb
Try a different casting
dsaxton Mar 27, 2020
b4edea3
No pd
dsaxton Mar 27, 2020
02fc55c
Only for add
dsaxton Mar 27, 2020
3044843
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 27, 2020
fe0ab93
Undo
dsaxton Mar 27, 2020
e6f5b4d
Release note
dsaxton Mar 22, 2020
c29dfad
Fix
dsaxton Mar 27, 2020
fb6b1d5
Space
dsaxton Mar 27, 2020
46eb601
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 27, 2020
2e65c14
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 28, 2020
a36131a
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 1, 2020
7c815b5
maybe_cast_result
dsaxton Apr 1, 2020
aeec4b0
float -> Int
dsaxton Apr 1, 2020
01c1d56
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 4, 2020
fc0f406
Less if
dsaxton Apr 4, 2020
47c19e2
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 5, 2020
8da2977
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 10, 2020
d74a905
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 14, 2020
ca3659e
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton May 4, 2020
0443d73
Merge remote-tracking branch 'upstream/master' into mincount-sum
simonjayhawkins May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,7 @@ Groupby/resample/rolling

- Bug in :meth:`GroupBy.apply` raises ``ValueError`` when the ``by`` axis is not sorted and has duplicates and the applied ``func`` does not mutate passed in objects (:issue:`30667`)
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`DataFrameGroupBy.sum` and :meth:`SeriesGroupBy.sum` where a large negative number would be returned when the number of non-null values was below ``min_count`` for nullable integer dtypes (:issue:`32861`)

Reshaping
^^^^^^^^^
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -553,7 +553,8 @@ def _cython_operation(
# Two options for avoiding this special case
# 1. mask-aware ops and avoid casting to float with NaN above
# 2. specify the result dtype when calling this method
result = result.astype("int64")
if not isna(result).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this condition to the above elif & add an appropriate comment as 3.

result = result.astype("int64")

if kind == "aggregate" and self._filter_empty_groups and not counts.all():
assert result.ndim != 2
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1636,3 +1636,20 @@ def test_apply_to_nullable_integer_returns_float(values, function):
result = groups.agg([function])
expected.columns = MultiIndex.from_tuples([("b", function)])
tm.assert_frame_equal(result, expected)


def test_groupby_sum_below_mincount_nullable_integer():
# https://github.com/pandas-dev/pandas/issues/32861
df = pd.DataFrame({"a": [0, 1, 2], "b": [0, 1, 2], "c": [0, 1, 2]}, dtype="Int64")
grouped = df.groupby("a")
idx = pd.Index([0, 1, 2], dtype=object, name="a")

result = grouped["b"].sum(min_count=2)
expected = pd.Series([np.nan] * 3, index=idx, name="b")
tm.assert_series_equal(result, expected)

result = grouped.sum(min_count=2)
expected = pd.DataFrame(
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal but it seems we get NA in the DataFrame case but NaN for Series. Should I add a FIXME comment with a follow-up issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? that seems odd, can you track this down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue about this. I am not sure this is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine then

)
tm.assert_frame_equal(result, expected)