Skip to content

BUG: Fix min_count issue for groupby.sum #32914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
2a3f814
Add test
dsaxton Mar 22, 2020
3ce0384
Check for null
dsaxton Mar 22, 2020
e25c823
Release note
dsaxton Mar 22, 2020
2656bc3
Update and comment
dsaxton Mar 22, 2020
300b1e5
Update test
dsaxton Mar 23, 2020
e2e08e2
Hack
dsaxton Mar 23, 2020
ea59d54
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 24, 2020
ab1050f
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 25, 2020
98f0922
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 26, 2020
ea1ef36
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 26, 2020
cdf7dfb
Try a different casting
dsaxton Mar 27, 2020
b4edea3
No pd
dsaxton Mar 27, 2020
02fc55c
Only for add
dsaxton Mar 27, 2020
3044843
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 27, 2020
fe0ab93
Undo
dsaxton Mar 27, 2020
e6f5b4d
Release note
dsaxton Mar 22, 2020
c29dfad
Fix
dsaxton Mar 27, 2020
fb6b1d5
Space
dsaxton Mar 27, 2020
46eb601
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 27, 2020
2e65c14
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Mar 28, 2020
a36131a
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 1, 2020
7c815b5
maybe_cast_result
dsaxton Apr 1, 2020
aeec4b0
float -> Int
dsaxton Apr 1, 2020
01c1d56
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 4, 2020
fc0f406
Less if
dsaxton Apr 4, 2020
47c19e2
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 5, 2020
8da2977
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 10, 2020
d74a905
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton Apr 14, 2020
ca3659e
Merge remote-tracking branch 'upstream/master' into mincount-sum
dsaxton May 4, 2020
0443d73
Merge remote-tracking branch 'upstream/master' into mincount-sum
simonjayhawkins May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -714,6 +714,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`GroupBy.count` causes segmentation fault when grouped-by column contains NaNs (:issue:`32841`)
- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` produces inconsistent type when aggregating Boolean series (:issue:`32894`)
- Bug in :meth:`DataFrameGroupBy.sum` and :meth:`SeriesGroupBy.sum` where a large negative number would be returned when the number of non-null values was below ``min_count`` for nullable integer dtypes (:issue:`32861`)
- Bug in :meth:`SeriesGroupBy.quantile` raising on nullable integers (:issue:`33136`)
- Bug in :meth:`SeriesGroupBy.first`, :meth:`SeriesGroupBy.last`, :meth:`SeriesGroupBy.min`, and :meth:`SeriesGroupBy.max` returning floats when applied to nullable Booleans (:issue:`33071`)
- Bug in :meth:`DataFrameGroupBy.agg` with dictionary input losing ``ExtensionArray`` dtypes (:issue:`32194`)
Expand Down
15 changes: 4 additions & 11 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from pandas.errors import AbstractMethodError
from pandas.util._decorators import cache_readonly

from pandas.core.dtypes.cast import maybe_cast_result
from pandas.core.dtypes.common import (
ensure_float64,
ensure_int64,
Expand Down Expand Up @@ -548,17 +549,6 @@ def _cython_operation(
if mask.any():
result = result.astype("float64")
result[mask] = np.nan
elif (
how == "add"
and is_integer_dtype(orig_values.dtype)
and is_extension_array_dtype(orig_values.dtype)
):
# We need this to ensure that Series[Int64Dtype].resample().sum()
# remains int64 dtype.
# Two options for avoiding this special case
# 1. mask-aware ops and avoid casting to float with NaN above
# 2. specify the result dtype when calling this method
result = result.astype("int64")

if kind == "aggregate" and self._filter_empty_groups and not counts.all():
assert result.ndim != 2
Expand All @@ -582,6 +572,9 @@ def _cython_operation(
elif is_datetimelike and kind == "aggregate":
result = result.astype(orig_values.dtype)

if is_extension_array_dtype(orig_values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move line 568 (the is_extension_array_dtype) before line 558, can we then completely remove lines 558-556? (these are already extension dtypes).

Copy link
Member Author

@dsaxton dsaxton Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it looks like we still get a lot of failures. I could add back the integer check to avoid trying to cast things twice?

result = maybe_cast_result(result=result, obj=orig_values, how=how)

return result, names

def aggregate(
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1639,3 +1639,20 @@ def test_apply_to_nullable_integer_returns_float(values, function):
result = groups.agg([function])
expected.columns = MultiIndex.from_tuples([("b", function)])
tm.assert_frame_equal(result, expected)


def test_groupby_sum_below_mincount_nullable_integer():
# https://github.com/pandas-dev/pandas/issues/32861
df = pd.DataFrame({"a": [0, 1, 2], "b": [0, 1, 2], "c": [0, 1, 2]}, dtype="Int64")
grouped = df.groupby("a")
idx = pd.Index([0, 1, 2], dtype=object, name="a")

result = grouped["b"].sum(min_count=2)
expected = pd.Series([pd.NA] * 3, dtype="Int64", index=idx, name="b")
tm.assert_series_equal(result, expected)

result = grouped.sum(min_count=2)
expected = pd.DataFrame(
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal but it seems we get NA in the DataFrame case but NaN for Series. Should I add a FIXME comment with a follow-up issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? that seems odd, can you track this down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue about this. I am not sure this is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine then

)
tm.assert_frame_equal(result, expected)