-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7936a9c
234419e
2e87770
850f985
b8f5a28
5db0964
e5cf34d
3f06416
eeba093
c2e75be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,6 @@ | |
|
||
from pandas._libs import ( | ||
NaT, | ||
iNaT, | ||
lib, | ||
) | ||
import pandas._libs.groupby as libgroupby | ||
|
@@ -663,12 +662,7 @@ def _cython_operation( | |
elif is_bool_dtype(dtype): | ||
values = ensure_int_or_float(values) | ||
elif is_integer_dtype(dtype): | ||
# we use iNaT for the missing value on ints | ||
# so pre-convert to guard this condition | ||
if (values == iNaT).any(): | ||
values = ensure_float64(values) | ||
else: | ||
values = ensure_int_or_float(values) | ||
values = ensure_int_or_float(values) | ||
elif is_numeric: | ||
if not is_complex_dtype(dtype): | ||
values = ensure_float64(values) | ||
|
@@ -686,20 +680,36 @@ def _cython_operation( | |
result = maybe_fill(np.empty(out_shape, dtype=out_dtype)) | ||
if kind == "aggregate": | ||
counts = np.zeros(ngroups, dtype=np.int64) | ||
func(result, counts, values, comp_ids, min_count) | ||
if how in ["min", "max"]: | ||
func( | ||
result, | ||
counts, | ||
values, | ||
comp_ids, | ||
min_count, | ||
is_datetimelike=is_datetimelike, | ||
) | ||
else: | ||
func(result, counts, values, comp_ids, min_count) | ||
elif kind == "transform": | ||
# TODO: min_count | ||
func(result, values, comp_ids, ngroups, is_datetimelike, **kwargs) | ||
|
||
if is_integer_dtype(result.dtype) and not is_datetimelike: | ||
mask = result == iNaT | ||
if mask.any(): | ||
result = result.astype("float64") | ||
result[mask] = np.nan | ||
|
||
if kind == "aggregate" and self._filter_empty_groups and not counts.all(): | ||
assert result.ndim != 2 | ||
result = result[counts > 0] | ||
if kind == "aggregate": | ||
# i.e. counts is defined. Locations where count<min_count | ||
# need to have the result set to np.nan, which may require casting, | ||
# see GH#40767 | ||
if is_integer_dtype(result.dtype) and not is_datetimelike: | ||
cutoff = max(1, min_count) | ||
empty_groups = counts < cutoff | ||
if empty_groups.any(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this check needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if empty_groups.any() then we need to mask in order to cast to float64 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. though this behavior is really surprising, though likely not hit practically. we should move aggressively to return Int dtypes here. Yes this is a breaking change but fixes these types of value dependent behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
we'd need 2D support for Int dtypes for me to consider getting on board with this, xref #38992 |
||
# Note: this conversion could be lossy, see GH#40767 | ||
result = result.astype("float64") | ||
result[empty_groups] = np.nan | ||
|
||
if self._filter_empty_groups and not counts.all(): | ||
assert result.ndim != 2 | ||
result = result[counts > 0] | ||
|
||
result = result.T | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these iNaT checks are necessary because in algos like
group_min_max
, it assumes that it is impossible for an integer array to have iNaT (because it assumesdatetimelike=True
and safety in setting missing values asNPY_NAT
).eg on this branch:
gives
because the iNaT is interpreted as NaN, so
min_count
isn't reached. (Then the NaN is set with iNaT, but not interpreted as NaN anymore)Forcing mask usage (#40651 (comment)) would be another way to handle this more robustly (since to specify missing values, the mask itself can just be modified inplace, precluding the need for the existing iNaT workarounds to signify missing values in the algo result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the underlying problem in this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason datetimelike isn't used in
group_min_max
(and some other groupby algos) is because of the additional problem that there is no way to encode a missing value in an integer (not datetimelike) array.So the current logic casts integer to float if iNaT is present, so that iNaT can be used to signify missing values for integer data (regardless of datetimelike status).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we dont need to encode it because we have
counts==0
(or counts<min_count) to convey that informationThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use masks we could avoid this awkwardness but that will need some refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not too hard to edit group_min_max to handle this example (will push in a bit). We can improve matters a bit by using
counts < min_count
as a mask more consistently, but that won't do anything about int64->float64 conversions being lossy. a full solution to that would be one of a) explicitly be OK with the lossiness, b) detect when the conversion would be lossy and cast to either object or float128, c) return an IntegerArray