-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Make group_mean compatible with NaT #43467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make group_mean compatible with NaT #43467
Conversation
Looks like there is a conflict with 6f4c382, will add |
you'll need to merge master and resolve the conflict |
3f271b2
to
f023f4b
Compare
@jbrockmendel There not a merge conflict but a semantic conflict: |
945f466
to
85f2d92
Compare
I have a couple of questions for the review:
|
Master is currently broken, CI cannot pass:
results in
Waiting on green master before submitting the PR for review. |
@AlexeyGy Thanks for the PR. The PR in the current state gives the expected output for #43132 but does not fix #42659 which looks like a very similar issue. The fact that both these regressions were the result of the same commit, I had hoped that a single patch would resolve both issues. Also because this is resolving a regression we should target 1.3.x. We don't have a milestone for 1.3.x and 1.3.3 is scheduled for tomorrow. I'll milestone 1.3.4 for now. |
not looked in detail but #43512 just ran through ci fine. |
Thanks @AlexeyGy for the updates. No need to add the 1.3.4 whatsnew (fails anyway until the 1.3.3 tag is available) it will be done just after the 1.3.3 release. best bet is to leave the release note in 1.3.3 for now, get ci passing and mark the PR as ready for review. |
@simonjayhawkins thanks for getting back. Agreed, as suggested, will adapt to 1.3.3 and try to get CI to pass. |
995a428
to
7ccffcf
Compare
pandas/_libs/groupby.pyx
Outdated
compensation[lab, j] = t - sumx[lab, j] - y | ||
sumx[lab, j] = t | ||
|
||
if val !=val or (is_datetimelike and val == NPY_NAT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace after "!="
can use treat_as_na?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for the first part. Thanks.
On the second part: Good call! I originally reused treat_as_na
in this PR, however, I found that it introduced a side-effect in the other usage of treat_as_na
in cummin_max
since, although _treat_as_na
accepts rank_t
as its value argument, it's conditions are currently specifically targeting int64_t values. Hence, I needed to add a condition for floating
values.
This was made more complicated since:
- cummin_max uses
groupby_t
as the data_type for its value array group_mean
usesfloating
as the data_type for its value array- and _treat_as_na uses
rank_t
I wasn't sure how these union types behave in practice and had problems getting tests to pass so decided to go for this option.
d0f4646
to
1dddf6f
Compare
Looks like CI is failing after 60 mins due to timeout or am I missing something? Not sure what to do now. |
its fine for now. these should now pass in the time limit, but are still failing occasionaly, each time you push be sure to merge master (if you are not already). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. cc @mzeitlin11 @jbrockmendel if any comments
@@ -703,7 +739,7 @@ def group_mean(floating[:, ::1] out, | |||
for j in range(K): | |||
val = values[i, j] | |||
# not nan | |||
if val == val: | |||
if val == val and not (is_datetimelike and val == NPY_NAT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here - would rather be explicit about the cast that's occurring to make this work (eg hopefully this is int equality, and not float equality since comparing floats is problematic).
On a brief investigation, one potential flaw here is that with nan_val = NPY_NAT
, it holds true not only that val == NPY_NAT
, but also that val == NPY_NAT + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an implicit guarantee that datetimelike arrays will be returned as int64
in calls to the function.
We can prevent any problems via asserting that is_datetimelike=True
arrays come as int64
at the top of the function. Going to try to do that and also add a test for NPY_NAT + 1
just to be sure. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is possible because the type floating
does not include int64
, which is why I am assumed that the data is being cast to float before being passed to the algo.
A test for NPY_NAT +1
would be great, if that's all good, then happy here (also fine with this as is, NPY_NAT
handling is probably more an important than the edge case of NPY_NAT + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's an issue with NPY_NAT + 1
, could also just see if that gets fixed by changing the fused type to also include int64_t
so that we can accept ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was wrong. There is a float64
cast down the line which occurs because get_cython_func_and_vals is called a bit further down which runs ensure_float64
and actually does a pointless copy to float64 operation of the whole input array https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_common_helper.pxi.in#L70.
We could change that path through the code to not do the cast for is_datetimelike but that would necessitate changing the input values data type to a union type which includes int64. Re-using add_t
seems the best way to go since it would bring us one step closer to re-using group_add as suggested.
There is some risk through the union data type though, If I look at the algorithm of group_add
, for one, it has extra-handling for the object
dtype). However, that's nothing that can't be handled with an assert statement and an additional unit test.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think groupby_t
would actually be a good candidate (add_t
doesn't appear to even include int64
). Hopefully just including those types won't necessitate much change (and can use _treat_as_na
for more unified missing value handling since fused types will be the same (for example looking in group_min_max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
- If we find a workable solution for GroupBy.mean, we should also try to use it for nanops.nanmean (used by DTI, Series, and DataFrame mean methods)
- if we stop casting to float64, we'll need to be extra-careful about silent overflows.
- I think it would be reasonable to fix the NaT-specific problem separately since the PR already presents a good solution for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some concerns about implicit casts and edge cases with data near NPY_NAT.
I don't think the concerns are easy to address, though, since we coerce data to float in group_mean
. Since this is fixing a regression, I'd be fine merging as is, though would feel better if testing was added to make sure dates that become pd.NaT + 1
, pd.NaT + 2
don't now become treated as missing.
As a sidenote, does anyone know why |
historical :-> we should try to unify things. |
Yah don't worry about that for now. This particular problem pop is hidden in a bunch of places and we don't have a great solution for it. I mentioned it because in that one particular line it happened to be more obvious than it is in other places.
Best guess is it wasn't an active decision, just turned out that way. Many people working on focused pieces of the code over many years, refactoring opportunities pile up. |
True if `values` contains datetime-like entries. | ||
mask : ndarray[bool, ndim=2], optional | ||
Not used. | ||
result_mask : ndarray[bool, ndim=2], optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding these if they arent used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so you can use the same call from the python code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, originally, I did not want to add useless extra params but after this comment here remarked that an additional path in the code is not wanted, I did not see any other way than to add these parameters here.
I also thought about actually using the params but that has nothing to do with the original scope of this PR so I chose not to invest more time here.
On the discussion about A. From what I see both are union types but I find it very hard to argue about Cython's floating. When I look into Shadow.py: |
AFAIU |
i am ok to merge this as its patching a regression, cc @mzeitlin11 @jbrockmendel certainly this area could use some refactoring (e.g. should we have all function shave the common signature as min,max,mean so we can call them generically) or similar things. |
Fine merging this too, just want to make sure not causing a new regression (#43467 (review)) (or that we accept this regression as more of an edge case than the other) |
@AlexeyGy can you add a test for this
|
Added a test and realized that we actually HAVE to stop converting to float as per this comment thread since the float64 precision is way too low to even add 1 to NaT in the input data: import numpy as np
np.array(np.iinfo(np.int64).min).astype('float64') == ( np.array(np.iinfo(np.int64).min).astype('float64') + 1) Returns Actually, even adding 100 would have no effect, I've noticed that numbers <~1000 (didn't check the exact number) get lost due to fp-arithmetic. |
I think its around 2^54 where the conversion gets lossy |
ok i think this is fine to merge. we have this NaN/NaT comparison issue currently everyone so no reason to hold this up. |
thanks @AlexeyGy very nice! |
@meeseeksdev backport 1.3.x |
Something went wrong ... Please have a look at my logs. |
Co-authored-by: Alexey Györi <[email protected]>
Make group_mean compatible with NaT
Background
NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to #43132.
Implementation
On a high level, we slightly modify the
group_mean
to not count NaT values. To do so, we introduce theis_datetimelike
parameter to the function call (already present in other functions, e.g.,group_cumsum
).Tests
This PR adds integration and unit tests for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.
Notes
Additionally, I've taken the liberty to: