-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if (values == iNaT).any(): | ||
values = ensure_float64(values) | ||
else: | ||
values = ensure_int_or_float(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these iNaT checks are necessary because in algos like group_min_max
, it assumes that it is impossible for an integer array to have iNaT (because it assumes datetimelike=True
and safety in setting missing values as NPY_NAT
).
eg on this branch:
ser = pd.Series([1, iNaT])
print(ser.groupby([1, 1]).max(min_count=2))
gives
1 -9223372036854775808
dtype: int64
because the iNaT is interpreted as NaN, so min_count
isn't reached. (Then the NaN is set with iNaT, but not interpreted as NaN anymore)
Forcing mask usage (#40651 (comment)) would be another way to handle this more robustly (since to specify missing values, the mask itself can just be modified inplace, precluding the need for the existing iNaT workarounds to signify missing values in the algo result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it assumes datetimelike=True
isn't this the underlying problem in this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason datetimelike isn't used in group_min_max
(and some other groupby algos) is because of the additional problem that there is no way to encode a missing value in an integer (not datetimelike) array.
So the current logic casts integer to float if iNaT is present, so that iNaT can be used to signify missing values for integer data (regardless of datetimelike status).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that there is no way to encode a missing value in an integer (not datetimelike) array.
i think we dont need to encode it because we have counts==0
(or counts<min_count) to convey that information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use masks we could avoid this awkwardness but that will need some refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not too hard to edit group_min_max to handle this example (will push in a bit). We can improve matters a bit by using counts < min_count
as a mask more consistently, but that won't do anything about int64->float64 conversions being lossy. a full solution to that would be one of a) explicitly be OK with the lossiness, b) detect when the conversion would be lossy and cast to either object or float128, c) return an IntegerArray
why would we need to cast? can we not simply use a mask and leave the values as int? |
that was option c in my comment, return an IntegerArray. For that we'd need IntegerArray to be 1) not opt-in and 2a) support 2D (#38992) or 2b) go full ArrayManager |
maybe u misunderstand in the cython code we use the incoming dtype and a mask no casting required and you can always return the original type (except of course you may have to cast to floats if make nans but that would be rare) |
suggested phrasing: "maybe we're talking past each other" or "maybe there's a miscommunication"
the casting under discussion is on the back end after the cython call. |
if is_integer_dtype(result.dtype) and not is_datetimelike: | ||
cutoff = max(1, min_count) | ||
empty_groups = counts < cutoff | ||
if empty_groups.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if empty_groups.any() then we need to mask in order to cast to float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though this behavior is really surprising, though likely not hit practically. we should move aggressively to return Int dtypes here. Yes this is a breaking change but fixes these types of value dependent behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move aggressively to return Int dtypes here
we'd need 2D support for Int dtypes for me to consider getting on board with this, xref #38992
pandas/core/groupby/ops.py
Outdated
assert result.ndim != 2 | ||
result = result[counts > 0] | ||
if kind == "aggregate": | ||
# i.e. counts is defined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reference either this PR & put in an expl here, this is is really unexpected and non-obvious what is happening. ping on green.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
this is a very slight user facing change right? can you add a whatsnew note (ref this PR) |
thanks @jbrockmendel |
xref #40719