Skip to content

Make group_mean compatible with NaT #43467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 18, 2021

Conversation

AlexeyGy
Copy link
Contributor

@AlexeyGy AlexeyGy commented Sep 8, 2021

Make group_mean compatible with NaT

Background

NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to #43132.

Implementation

On a high level, we slightly modify the group_mean to not count NaT values. To do so, we introduce the is_datetimelike parameter to the function call (already present in other functions, e.g., group_cumsum).

Tests

This PR adds integration and unit tests for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible.

Notes

Additionally, I've taken the liberty to:

  • Add a docstring for the group_mean algorithm.

@AlexeyGy
Copy link
Contributor Author

AlexeyGy commented Sep 9, 2021

Looks like there is a conflict with 6f4c382, will add mask and result_mask as parameters. @jbrockmendel .

@jbrockmendel
Copy link
Member

Looks like there is a conflict with

you'll need to merge master and resolve the conflict

@AlexeyGy AlexeyGy force-pushed the groupby-mean-datetimelike branch 2 times, most recently from 3f271b2 to f023f4b Compare September 9, 2021 16:17
@AlexeyGy
Copy link
Contributor Author

AlexeyGy commented Sep 9, 2021

@jbrockmendel There not a merge conflict but a semantic conflict: mask and result_mask were added to the aggregate invocation that this commit modifies in ops.py but that shouldn’t be too hard to fix (however, a bit more difficult than just a merge conflict). Just need some more time.

@AlexeyGy AlexeyGy force-pushed the groupby-mean-datetimelike branch 3 times, most recently from 945f466 to 85f2d92 Compare September 10, 2021 12:52
@AlexeyGy
Copy link
Contributor Author

I have a couple of questions for the review:

  • I am not sure if I filled out whatsnew correctly.
  • Should I have put the new integration test under pandas/tests/groupby/transform/test_transform.py or under pandas/tests/groupby/test_groupby.py. I've opted for just one integration test, should I create more? If yes, any tests that you would like to see?

@AlexeyGy
Copy link
Contributor Author

Master is currently broken, CI cannot pass:

pytest pandas/tests/groupby/test_quantile.py -x

results in

>   ndarray[int64_t] counts, non_na_counts, sort_arr
E   TypeError: group_quantile() takes exactly 6 positional arguments (4 given)

Waiting on green master before submitting the PR for review.

@simonjayhawkins
Copy link
Member

@AlexeyGy Thanks for the PR.

The PR in the current state gives the expected output for #43132 but does not fix #42659 which looks like a very similar issue.

The fact that both these regressions were the result of the same commit, I had hoped that a single patch would resolve both issues.

Also because this is resolving a regression we should target 1.3.x. We don't have a milestone for 1.3.x and 1.3.3 is scheduled for tomorrow. I'll milestone 1.3.4 for now.

@simonjayhawkins simonjayhawkins added this to the 1.3.4 milestone Sep 11, 2021
@simonjayhawkins
Copy link
Member

Master is currently broken, CI cannot pass:

not looked in detail but #43512 just ran through ci fine.

@simonjayhawkins
Copy link
Member

Thanks @AlexeyGy for the updates. No need to add the 1.3.4 whatsnew (fails anyway until the 1.3.3 tag is available) it will be done just after the 1.3.3 release.

best bet is to leave the release note in 1.3.3 for now, get ci passing and mark the PR as ready for review.

@AlexeyGy
Copy link
Contributor Author

@simonjayhawkins thanks for getting back. Agreed, as suggested, will adapt to 1.3.3 and try to get CI to pass.

@AlexeyGy AlexeyGy force-pushed the groupby-mean-datetimelike branch from 995a428 to 7ccffcf Compare September 11, 2021 22:38
compensation[lab, j] = t - sumx[lab, j] - y
sumx[lab, j] = t

if val !=val or (is_datetimelike and val == NPY_NAT):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace after "!="

can use treat_as_na?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for the first part. Thanks.

On the second part: Good call! I originally reused treat_as_na in this PR, however, I found that it introduced a side-effect in the other usage of treat_as_na in cummin_max since, although _treat_as_na accepts rank_t as its value argument, it's conditions are currently specifically targeting int64_t values. Hence, I needed to add a condition for floating values.

This was made more complicated since:

  • cummin_max uses groupby_t as the data_type for its value array
  • group_mean uses floating as the data_type for its value array
  • and _treat_as_na usesrank_t

I wasn't sure how these union types behave in practice and had problems getting tests to pass so decided to go for this option.

@pep8speaks
Copy link

pep8speaks commented Sep 12, 2021

Hello @AlexeyGy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-15 19:43:05 UTC

@AlexeyGy AlexeyGy force-pushed the groupby-mean-datetimelike branch 2 times, most recently from d0f4646 to 1dddf6f Compare September 13, 2021 16:37
@AlexeyGy
Copy link
Contributor Author

Looks like CI is failing after 60 mins due to timeout or am I missing something? Not sure what to do now.

@AlexeyGy AlexeyGy marked this pull request as ready for review September 13, 2021 19:09
@jreback
Copy link
Contributor

jreback commented Sep 13, 2021

Looks like CI is failing after 60 mins due to timeout or am I missing something? Not sure what to do now.

its fine for now. these should now pass in the time limit, but are still failing occasionaly, each time you push be sure to merge master (if you are not already).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. cc @mzeitlin11 @jbrockmendel if any comments

@@ -703,7 +739,7 @@ def group_mean(floating[:, ::1] out,
for j in range(K):
val = values[i, j]
# not nan
if val == val:
if val == val and not (is_datetimelike and val == NPY_NAT):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here - would rather be explicit about the cast that's occurring to make this work (eg hopefully this is int equality, and not float equality since comparing floats is problematic).

On a brief investigation, one potential flaw here is that with nan_val = NPY_NAT, it holds true not only that val == NPY_NAT, but also that val == NPY_NAT + 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an implicit guarantee that datetimelike arrays will be returned as int64 in calls to the function.

We can prevent any problems via asserting that is_datetimelike=True arrays come as int64 at the top of the function. Going to try to do that and also add a test for NPY_NAT + 1 just to be sure. Sounds good?

Copy link
Member

@mzeitlin11 mzeitlin11 Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is possible because the type floating does not include int64, which is why I am assumed that the data is being cast to float before being passed to the algo.

A test for NPY_NAT +1 would be great, if that's all good, then happy here (also fine with this as is, NPY_NAT handling is probably more an important than the edge case of NPY_NAT + 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an issue with NPY_NAT + 1, could also just see if that gets fixed by changing the fused type to also include int64_t so that we can accept ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wrong. There is a float64 cast down the line which occurs because get_cython_func_and_vals is called a bit further down which runs ensure_float64 and actually does a pointless copy to float64 operation of the whole input array https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_common_helper.pxi.in#L70.

We could change that path through the code to not do the cast for is_datetimelike but that would necessitate changing the input values data type to a union type which includes int64. Re-using add_t seems the best way to go since it would bring us one step closer to re-using group_add as suggested.

There is some risk through the union data type though, If I look at the algorithm of group_add, for one, it has extra-handling for the object dtype). However, that's nothing that can't be handled with an assert statement and an additional unit test.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think groupby_t would actually be a good candidate (add_t doesn't appear to even include int64). Hopefully just including those types won't necessitate much change (and can use _treat_as_na for more unified missing value handling since fused types will be the same (for example looking in group_min_max

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

  1. If we find a workable solution for GroupBy.mean, we should also try to use it for nanops.nanmean (used by DTI, Series, and DataFrame mean methods)
  2. if we stop casting to float64, we'll need to be extra-careful about silent overflows.
  3. I think it would be reasonable to fix the NaT-specific problem separately since the PR already presents a good solution for that

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some concerns about implicit casts and edge cases with data near NPY_NAT.

I don't think the concerns are easy to address, though, since we coerce data to float in group_mean. Since this is fixing a regression, I'd be fine merging as is, though would feel better if testing was added to make sure dates that become pd.NaT + 1, pd.NaT + 2 don't now become treated as missing.

@mzeitlin11
Copy link
Member

As a sidenote, does anyone know why group_mean doesn't use group_add? Seems like the right long-term solution would be to allow group_mean to take the fused type add_t, then just call group_add to do all the additions it does anyway. This would fix the issues described above since datetimelike ints wouldn't need to be cast to float (although it appears group_add doesn't use the datetimelike arg, so that might need to be fixed first).

@jreback
Copy link
Contributor

jreback commented Sep 13, 2021

As a sidenote, does anyone know why group_mean doesn't use group_add? Seems like the right long-term solution would be to allow group_mean to take the fused type add_t, then just call group_add to do all the additions it does anyway. This would fix the issues described above since datetimelike ints wouldn't need to be cast to float (although it appears group_add doesn't use the datetimelike arg, so that might need to be fixed first).

historical :-> we should try to unify things.

@jbrockmendel
Copy link
Member

Left some concerns about implicit casts and edge cases with data near NPY_NAT.

Yah don't worry about that for now. This particular problem pop is hidden in a bunch of places and we don't have a great solution for it. I mentioned it because in that one particular line it happened to be more obvious than it is in other places.

As a sidenote, does anyone know why group_mean doesn't use group_add?

Best guess is it wasn't an active decision, just turned out that way. Many people working on focused pieces of the code over many years, refactoring opportunities pile up.

True if `values` contains datetime-like entries.
mask : ndarray[bool, ndim=2], optional
Not used.
result_mask : ndarray[bool, ndim=2], optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding these if they arent used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so you can use the same call from the python code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, originally, I did not want to add useless extra params but after this comment here remarked that an additional path in the code is not wanted, I did not see any other way than to add these parameters here.

I also thought about actually using the params but that has nothing to do with the original scope of this PR so I chose not to invest more time here.

@AlexeyGy
Copy link
Contributor Author

AlexeyGy commented Sep 15, 2021

On the discussion about group_add and group_mean, just a quick question: which type does the pandas team prefer?

A. add_t like group_add
B. or the Cython-native floating group_mean`
as an argument for functions?

From what I see both are union types but I find it very hard to argue about Cython's floating. When I look into Shadow.py: integral = floating = numeric = _FusedType() it seems very non-transparent what it can or cannot be.

@mzeitlin11
Copy link
Member

On the discussion about group_add and group_mean, just a quick question: which type does the pandas team prefer?

AFAIU floating is just another fused type (https://cython.readthedocs.io/en/latest/src/userguide/fusedtypes.html#built-in-fused-types). So not really a difference except that floating is defined by cython. But I think it's equivalent to just defining our own fused type with float32 and float64

@jreback
Copy link
Contributor

jreback commented Sep 16, 2021

i am ok to merge this as its patching a regression, cc @mzeitlin11 @jbrockmendel certainly this area could use some refactoring (e.g. should we have all function shave the common signature as min,max,mean so we can call them generically) or similar things.

@mzeitlin11
Copy link
Member

i am ok to merge this as its patching a regression

Fine merging this too, just want to make sure not causing a new regression (#43467 (review)) (or that we accept this regression as more of an edge case than the other)

@jreback
Copy link
Contributor

jreback commented Sep 16, 2021

@AlexeyGy can you add a test for this

I don't think the concerns are easy to address, though, since we coerce data to float in group_mean. Since this is fixing a regression, I'd be fine merging as is, though would feel better if testing was added to make sure dates that become pd.NaT + 1, pd.NaT + 2 don't now become treated as missing.

@AlexeyGy
Copy link
Contributor Author

AlexeyGy commented Sep 16, 2021

Added a test and realized that we actually HAVE to stop converting to float as per this comment thread since the float64 precision is way too low to even add 1 to NaT in the input data:

import numpy as np
np.array(np.iinfo(np.int64).min).astype('float64') ==  ( np.array(np.iinfo(np.int64).min).astype('float64') + 1)

Returns True.

Actually, even adding 100 would have no effect, I've noticed that numbers <~1000 (didn't check the exact number) get lost due to fp-arithmetic.

@jbrockmendel
Copy link
Member

I've noticed that numbers <~1000 (didn't check the exact number)

I think its around 2^54 where the conversion gets lossy

@jreback
Copy link
Contributor

jreback commented Sep 18, 2021

ok i think this is fine to merge. we have this NaN/NaT comparison issue currently everyone so no reason to hold this up.

@jreback jreback merged commit 011c016 into pandas-dev:master Sep 18, 2021
@jreback
Copy link
Contributor

jreback commented Sep 18, 2021

thanks @AlexeyGy very nice!

@jreback
Copy link
Contributor

jreback commented Sep 18, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 18, 2021

Something went wrong ... Please have a look at my logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT
6 participants