Make group_mean compatible with NaT #43467

AlexeyGy · 2021-09-08T22:28:28Z

Make group_mean compatible with NaT

Background

NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to #43132.

Implementation

On a high level, we slightly modify the group_mean to not count NaT values. To do so, we introduce the is_datetimelike parameter to the function call (already present in other functions, e.g., group_cumsum).

Tests

This PR adds integration and unit tests for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Notes

Additionally, I've taken the liberty to:

Add a docstring for the group_mean algorithm.

closes BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

AlexeyGy · 2021-09-09T07:24:24Z

Looks like there is a conflict with 6f4c382, will add mask and result_mask as parameters. @jbrockmendel .

jbrockmendel · 2021-09-09T15:45:27Z

Looks like there is a conflict with

you'll need to merge master and resolve the conflict

AlexeyGy · 2021-09-09T16:23:59Z

@jbrockmendel There not a merge conflict but a semantic conflict: mask and result_mask were added to the aggregate invocation that this commit modifies in ops.py but that shouldn’t be too hard to fix (however, a bit more difficult than just a merge conflict). Just need some more time.

AlexeyGy · 2021-09-10T21:44:22Z

I have a couple of questions for the review:

I am not sure if I filled out whatsnew correctly.
Should I have put the new integration test under pandas/tests/groupby/transform/test_transform.py or under pandas/tests/groupby/test_groupby.py. I've opted for just one integration test, should I create more? If yes, any tests that you would like to see?

AlexeyGy · 2021-09-11T10:38:19Z

Master is currently broken, CI cannot pass:

pytest pandas/tests/groupby/test_quantile.py -x

results in

>   ndarray[int64_t] counts, non_na_counts, sort_arr
E   TypeError: group_quantile() takes exactly 6 positional arguments (4 given)

Waiting on green master before submitting the PR for review.

simonjayhawkins · 2021-09-11T12:24:29Z

@AlexeyGy Thanks for the PR.

The PR in the current state gives the expected output for #43132 but does not fix #42659 which looks like a very similar issue.

The fact that both these regressions were the result of the same commit, I had hoped that a single patch would resolve both issues.

Also because this is resolving a regression we should target 1.3.x. We don't have a milestone for 1.3.x and 1.3.3 is scheduled for tomorrow. I'll milestone 1.3.4 for now.

simonjayhawkins · 2021-09-11T12:33:00Z

Master is currently broken, CI cannot pass:

not looked in detail but #43512 just ran through ci fine.

simonjayhawkins · 2021-09-11T21:29:13Z

Thanks @AlexeyGy for the updates. No need to add the 1.3.4 whatsnew (fails anyway until the 1.3.3 tag is available) it will be done just after the 1.3.3 release.

best bet is to leave the release note in 1.3.3 for now, get ci passing and mark the PR as ready for review.

AlexeyGy · 2021-09-11T22:13:12Z

@simonjayhawkins thanks for getting back. Agreed, as suggested, will adapt to 1.3.3 and try to get CI to pass.

jbrockmendel · 2021-09-12T03:12:15Z

pandas/_libs/groupby.pyx

-                    compensation[lab, j] = t - sumx[lab, j] - y
-                    sumx[lab, j] = t
-
+                if val !=val or (is_datetimelike and val == NPY_NAT):


whitespace after "!="

can use treat_as_na?

Done for the first part. Thanks.

On the second part: Good call! I originally reused treat_as_na in this PR, however, I found that it introduced a side-effect in the other usage of treat_as_na in cummin_max since, although _treat_as_na accepts rank_t as its value argument, it's conditions are currently specifically targeting int64_t values. Hence, I needed to add a condition for floating values.

This was made more complicated since:

cummin_max uses groupby_t as the data_type for its value array

group_mean uses floating as the data_type for its value array

and _treat_as_na usesrank_t

I wasn't sure how these union types behave in practice and had problems getting tests to pass so decided to go for this option.

pandas/_libs/groupby.pyx

pandas/tests/groupby/transform/test_transform.py

doc/source/whatsnew/v1.3.3.rst

pandas/_libs/groupby.pyx

pep8speaks · 2021-09-12T12:36:42Z

Hello @AlexeyGy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-15 19:43:05 UTC

pandas/_libs/groupby.pyx

pandas/core/groupby/ops.py

pandas/tests/groupby/aggregate/test_aggregate.py

doc/source/whatsnew/v1.3.3.rst

AlexeyGy · 2021-09-13T19:09:30Z

Looks like CI is failing after 60 mins due to timeout or am I missing something? Not sure what to do now.

jreback · 2021-09-13T19:15:04Z

Looks like CI is failing after 60 mins due to timeout or am I missing something? Not sure what to do now.

its fine for now. these should now pass in the time limit, but are still failing occasionaly, each time you push be sure to merge master (if you are not already).

jreback

lgtm. cc @mzeitlin11 @jbrockmendel if any comments

pandas/_libs/groupby.pyx

mzeitlin11 · 2021-09-13T20:38:38Z

pandas/_libs/groupby.pyx

@@ -703,7 +739,7 @@ def group_mean(floating[:, ::1] out,
            for j in range(K):
                val = values[i, j]
                # not nan
-                if val == val:
+                if val == val and not (is_datetimelike and val == NPY_NAT):


Similar comment here - would rather be explicit about the cast that's occurring to make this work (eg hopefully this is int equality, and not float equality since comparing floats is problematic).

On a brief investigation, one potential flaw here is that with nan_val = NPY_NAT, it holds true not only that val == NPY_NAT, but also that val == NPY_NAT + 1

We have an implicit guarantee that datetimelike arrays will be returned as int64 in calls to the function.

We can prevent any problems via asserting that is_datetimelike=True arrays come as int64 at the top of the function. Going to try to do that and also add a test for NPY_NAT + 1 just to be sure. Sounds good?

I don't think that is possible because the type floating does not include int64, which is why I am assumed that the data is being cast to float before being passed to the algo.

A test for NPY_NAT +1 would be great, if that's all good, then happy here (also fine with this as is, NPY_NAT handling is probably more an important than the edge case of NPY_NAT + 1

If there's an issue with NPY_NAT + 1, could also just see if that gets fixed by changing the fused type to also include int64_t so that we can accept ints

Yes, I was wrong. There is a float64 cast down the line which occurs because get_cython_func_and_vals is called a bit further down which runs ensure_float64 and actually does a pointless copy to float64 operation of the whole input array https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_common_helper.pxi.in#L70.

We could change that path through the code to not do the cast for is_datetimelike but that would necessitate changing the input values data type to a union type which includes int64. Re-using add_t seems the best way to go since it would bring us one step closer to re-using group_add as suggested.

There is some risk through the union data type though, If I look at the algorithm of group_add, for one, it has extra-handling for the object dtype). However, that's nothing that can't be handled with an assert statement and an additional unit test.

What do you think?

I think groupby_t would actually be a good candidate (add_t doesn't appear to even include int64). Hopefully just including those types won't necessitate much change (and can use _treat_as_na for more unified missing value handling since fused types will be the same (for example looking in group_min_max

What do you think?

If we find a workable solution for GroupBy.mean, we should also try to use it for nanops.nanmean (used by DTI, Series, and DataFrame mean methods)

if we stop casting to float64, we'll need to be extra-careful about silent overflows.

I think it would be reasonable to fix the NaT-specific problem separately since the PR already presents a good solution for that

mzeitlin11

Left some concerns about implicit casts and edge cases with data near NPY_NAT.

I don't think the concerns are easy to address, though, since we coerce data to float in group_mean. Since this is fixing a regression, I'd be fine merging as is, though would feel better if testing was added to make sure dates that become pd.NaT + 1, pd.NaT + 2 don't now become treated as missing.

mzeitlin11 · 2021-09-13T21:00:09Z

As a sidenote, does anyone know why group_mean doesn't use group_add? Seems like the right long-term solution would be to allow group_mean to take the fused type add_t, then just call group_add to do all the additions it does anyway. This would fix the issues described above since datetimelike ints wouldn't need to be cast to float (although it appears group_add doesn't use the datetimelike arg, so that might need to be fixed first).

jreback · 2021-09-13T21:45:26Z

As a sidenote, does anyone know why group_mean doesn't use group_add? Seems like the right long-term solution would be to allow group_mean to take the fused type add_t, then just call group_add to do all the additions it does anyway. This would fix the issues described above since datetimelike ints wouldn't need to be cast to float (although it appears group_add doesn't use the datetimelike arg, so that might need to be fixed first).

historical :-> we should try to unify things.

jbrockmendel · 2021-09-14T23:51:34Z

Left some concerns about implicit casts and edge cases with data near NPY_NAT.

Yah don't worry about that for now. This particular problem pop is hidden in a bunch of places and we don't have a great solution for it. I mentioned it because in that one particular line it happened to be more obvious than it is in other places.

As a sidenote, does anyone know why group_mean doesn't use group_add?

Best guess is it wasn't an active decision, just turned out that way. Many people working on focused pieces of the code over many years, refactoring opportunities pile up.

jbrockmendel · 2021-09-14T23:52:17Z

pandas/_libs/groupby.pyx

+        True if `values` contains datetime-like entries.
+    mask : ndarray[bool, ndim=2], optional
+        Not used.
+    result_mask : ndarray[bool, ndim=2], optional


why adding these if they arent used?

oh, so you can use the same call from the python code?

Correct, originally, I did not want to add useless extra params but after this comment here remarked that an additional path in the code is not wanted, I did not see any other way than to add these parameters here.

I also thought about actually using the params but that has nothing to do with the original scope of this PR so I chose not to invest more time here.

AlexeyGy · 2021-09-15T20:07:35Z

On the discussion about group_add and group_mean, just a quick question: which type does the pandas team prefer?

A. add_t like group_add
B. or the Cython-native floating group_mean`
as an argument for functions?

From what I see both are union types but I find it very hard to argue about Cython's floating. When I look into Shadow.py: integral = floating = numeric = _FusedType() it seems very non-transparent what it can or cannot be.

mzeitlin11 · 2021-09-15T20:29:13Z

On the discussion about group_add and group_mean, just a quick question: which type does the pandas team prefer?

AFAIU floating is just another fused type (https://cython.readthedocs.io/en/latest/src/userguide/fusedtypes.html#built-in-fused-types). So not really a difference except that floating is defined by cython. But I think it's equivalent to just defining our own fused type with float32 and float64

jreback · 2021-09-16T18:53:26Z

i am ok to merge this as its patching a regression, cc @mzeitlin11 @jbrockmendel certainly this area could use some refactoring (e.g. should we have all function shave the common signature as min,max,mean so we can call them generically) or similar things.

mzeitlin11 · 2021-09-16T19:09:07Z

i am ok to merge this as its patching a regression

Fine merging this too, just want to make sure not causing a new regression (#43467 (review)) (or that we accept this regression as more of an edge case than the other)

jreback · 2021-09-16T19:35:46Z

@AlexeyGy can you add a test for this

I don't think the concerns are easy to address, though, since we coerce data to float in group_mean. Since this is fixing a regression, I'd be fine merging as is, though would feel better if testing was added to make sure dates that become pd.NaT + 1, pd.NaT + 2 don't now become treated as missing.

AlexeyGy · 2021-09-16T22:07:08Z

Added a test and realized that we actually HAVE to stop converting to float as per this comment thread since the float64 precision is way too low to even add 1 to NaT in the input data:

import numpy as np
np.array(np.iinfo(np.int64).min).astype('float64') ==  ( np.array(np.iinfo(np.int64).min).astype('float64') + 1)

Returns True.

Actually, even adding 100 would have no effect, I've noticed that numbers <~1000 (didn't check the exact number) get lost due to fp-arithmetic.

jbrockmendel · 2021-09-17T22:52:48Z

I've noticed that numbers <~1000 (didn't check the exact number)

I think its around 2^54 where the conversion gets lossy

jreback · 2021-09-18T02:30:39Z

ok i think this is fine to merge. we have this NaN/NaT comparison issue currently everyone so no reason to hold this up.

jreback · 2021-09-18T02:31:59Z

thanks @AlexeyGy very nice!

jreback · 2021-09-18T02:32:06Z

@meeseeksdev backport 1.3.x

lumberbot-app · 2021-09-18T02:32:24Z

Something went wrong ... Please have a look at my logs.

Co-authored-by: Alexey Györi <[email protected]>

AlexeyGy mentioned this pull request Sep 8, 2021

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

Closed

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 9, 2021

AlexeyGy force-pushed the groupby-mean-datetimelike branch 2 times, most recently from 3f271b2 to f023f4b Compare September 9, 2021 16:17

AlexeyGy force-pushed the groupby-mean-datetimelike branch 3 times, most recently from 945f466 to 85f2d92 Compare September 10, 2021 12:52

simonjayhawkins added this to the 1.3.4 milestone Sep 11, 2021

AlexeyGy force-pushed the groupby-mean-datetimelike branch from 995a428 to 7ccffcf Compare September 11, 2021 22:38