BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

behrenhoff · 2021-08-20T13:26:40Z

Code Sample, a copy-pastable example

The problem is best shown in the following example:

import pandas as pd
s = pd.Series(['1 day', '3 days', pd.NaT], dtype='timedelta64[ns]')

print(f'{s.mean()=}\n')
# as expected: result is '2 days'

print(f"{s.apply('mean')=}\n")
# as expected: result is '2 days'

print("transform('mean')")
print(s.groupby([1,1,1]).transform('mean'))
# whoops: (a series of 3x) -35583 days +08:04:14.381741568
# expectation was: (a series of 3x) '2 days'

# Fixing the problem
print("\ntransform(pd.Series.mean)")
print(s.groupby([1,1,1]).transform(pd.Series.mean))
# (a series of 3x) '2 days', sanity restored :-)

Problem description

When passing a function name by string to transform, pd.NaT is not handled correctly. This not only happens for 'mean' but also for other functions like 'sum'.

I don't know how transform is looking up the function if a string is passed (and it's not really documented), but it certainly doesn't select pd.Series.mean but some other non-NaT-aware mean function.

This is surprising since NaT is handled correctly when calling apply('mean').

Tested with pd 1.3.2.

The text was updated successfully, but these errors were encountered:

AlexeyGy · 2021-08-22T08:00:21Z

Can I work on this issue?

simonjayhawkins · 2021-08-26T13:58:01Z

s.groupby([1, 1, 1]).transform("mean") was raising DataError: No numeric types to aggregate with 1.2.5

I suspect this issue maybe related to or a duplicate of #42659. will label as a regression pending further investigation cc @jbrockmendel

simonjayhawkins · 2021-08-27T11:50:28Z

I suspect this issue maybe related to or a duplicate of #42659. will label as a regression pending further investigation cc @jbrockmendel

first bad commit: [fa8f68e] BUG/API: SeriesGroupBy reduction with numeric_only=True (#41342)

AlexeyGy · 2021-08-27T21:56:02Z

This happens because pd.Series.mean and 'mean' will lead to totally different paths through the code and hence different calculations:

pd.Series.mean will use _transform_general and apply the given function which eventually gets us into DatetimeLikeArrayMixin#mean which fully supports nan values.
'mean' gets us into groupby#mean this in turn gets us to this line

values = values.view("int64")

which calls numpy and turns pd.NaT into a very low integer causing this problem.

Questions

@simonjayhawkins

Are there any plans to harmonize both paths? It would surely be easy to just make 'mean' take the same route as pd.Series.mean.
To mitigate, we could modify _resolve_numeric_only to return True if there are any NaN values in an otherwise numeric set. What do you think?

simonjayhawkins · 2021-08-31T12:02:05Z

Thanks @AlexeyGy for exploring this issue further.

Are there any plans to harmonize both paths? It would surely be easy to just make 'mean' take the same route as pd.Series.mean.

refactoring, de-duplication, cleaning and maintaining api consistency is an ongoing process, but that may not be the best option if we wanted to backport any fixes to 1.3.x unless we limited the dispatch from groupby.mean to DatetimeLikeArrayMixin.mean @jbrockmendel?

To mitigate, we could modify _resolve_numeric_only to return True if there are any NaN values in an otherwise numeric set.

There has been much discussion in #42395 and #43154 about changes to _resolve_numeric_only so I think further changes there is likely not desirable.

There is also the case that the 1.2.x behavior was incorrect and that the new behavior is a new bug and not a regression. But personally I would prefer to see 1.3.x patched to fix this.

#42659 is clearly a regression and it may well be that a fix for that case fixes this one too.

jbrockmendel · 2021-08-31T22:22:23Z

The correct long-term solution is for libgroupby.group_mean to take a is_datetimelike keyword like many of the other functions there do.

@AlexeyGy any interest in implementing this?

AlexeyGy · 2021-09-01T06:56:52Z

Yes, happy to help. Should not be a problem since I can peek at the other implementations in the file.

Awesome that there are even tests for both, libgroupby, and groupby already which I can extend.

## Background NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132. ## Implementation On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64. ## Tests This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible. ## Notes Additionally, I've taken the liberty to: * Add a docstring for the group_mean algorithm. * Change the algorithm to use guard clauses instead of else/if. * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file). - [x] closes pandas-dev#43132 - [x] tests added / passed - [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them - [x] whatsnew entry => different format but it's there

AlexeyGy · 2021-09-08T22:30:04Z

I've drafted up #43467. Will want to add a couple more tests to ensure that nothing is breaking but want to share it out early. Open for feedback on it.

NaT is the datetime equivalent of NaN and is set to be the lowest possible 64 bit integer -(2**63). Previously, we could not support this value in any groupby.mean() calculations which lead to pandas-dev#43132. On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64. This PR add an additional integration and unit test for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible. Additionally, I've taken the liberty to: * Add a docstring for the group_mean algorithm. * Change the algorithm to use guard clauses instead of else/if. * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file). - [x] closes pandas-dev#43132 - [x] tests added / passed - [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them - [x] whatsnew entry => different format but it's there

NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any `groupby.mean()` calculations which lead to pandas-dev#43132. On a high level, we slightly modify the `group_mean` to not count NaT values. To do so, we introduce the `is_datetimelike` parameter to the function call (already present in other functions, e.g., `group_cumsum`) and refactor and extend `#_treat_as_na` to work with float64. ## Tests This PR adds an integration and two unit tests for the new functionality. In contrast to other tests in classes, I've tried to keep an individual test's scope as small as possible. Additionally, I've taken the liberty to: * Add a docstring for the group_mean algorithm. * Change the algorithm to use guard clauses instead of if/else. * Add a comment that we're using the Kahan summation (the compensation part initially confused me, and I only stumbled upon Kahan when browsing the file). - [x] closes pandas-dev#43132 - [x] tests added / passed - [x] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them - [x] whatsnew entry

NaT is the datetime equivalent of NaN and is set to be the lowest possible 64-bit integer -(2**63). Previously, we could not support this value in any `groupby.mean()` calculations which lead to pandas-dev#43132.

behrenhoff added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2021

behrenhoff changed the title ~~BUG: timeseries.groupby(...).transform('mean') wrong if aggregating over pd.NaT~~ BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT Aug 20, 2021

mroeschke added Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 22, 2021

simonjayhawkins added this to the 1.3.3 milestone Aug 26, 2021

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 26, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 27, 2021

code sample for pandas-dev#43132

2b2e7cc

AlexeyGy mentioned this issue Sep 8, 2021

Make group_mean compatible with NaT #43467

Merged

4 tasks

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

jreback closed this as completed in #43467 Sep 18, 2021

simonjayhawkins mentioned this issue Sep 18, 2021

Backport PR #43608: REGR: spearman corr raising on 32-bit #43614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

behrenhoff commented Aug 20, 2021

AlexeyGy commented Aug 22, 2021

simonjayhawkins commented Aug 26, 2021

simonjayhawkins commented Aug 27, 2021

AlexeyGy commented Aug 27, 2021 •

edited

Loading

simonjayhawkins commented Aug 31, 2021

jbrockmendel commented Aug 31, 2021

AlexeyGy commented Sep 1, 2021

AlexeyGy commented Sep 8, 2021

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

BUG: timeseries.groupby(...).transform('mean') wrong when aggregating over pd.NaT #43132

Comments

behrenhoff commented Aug 20, 2021

Code Sample, a copy-pastable example

Problem description

AlexeyGy commented Aug 22, 2021

simonjayhawkins commented Aug 26, 2021

simonjayhawkins commented Aug 27, 2021

AlexeyGy commented Aug 27, 2021 • edited Loading

Questions

simonjayhawkins commented Aug 31, 2021

jbrockmendel commented Aug 31, 2021

AlexeyGy commented Sep 1, 2021

AlexeyGy commented Sep 8, 2021

AlexeyGy commented Aug 27, 2021 •

edited

Loading