Skip to content

BUG: Fix ts precision issue with groupby and NaT (#19526) #19530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,7 @@ Groupby/Resample/Rolling
- Fixed regression in :func:`DataFrame.groupby` which would not emit an error when called with a tuple key not in the index (:issue:`18798`)
- Bug in :func:`DataFrame.resample` which silently ignored unsupported (or mistyped) options for ``label``, ``closed`` and ``convention`` (:issue:`19303`)
- Bug in :func:`DataFrame.groupby` where tuples were interpreted as lists of keys rather than as keys (:issue:`17979`, :issue:`18249`)
- Bug in :func:`DataFrame.groupby` where aggregation by ``first``/``last``/``min``/``max`` was causing timestamps to lose precision (:issue:`19526`)
- Bug in :func:`DataFrame.transform` where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (:issue:`19200`)
- Bug in :func:`DataFrame.groupby` passing the `on=` kwarg, and subsequently using ``.apply()`` (:issue:`17813`)

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2336,7 +2336,7 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1):
result = self._transform(
result, values, labels, func, is_numeric, is_datetimelike)

if is_integer_dtype(result):
if is_integer_dtype(result) and not is_datetimelike:
mask = result == iNaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can then remove this masking bizness (inside the is_integer_dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might also fix #16674, can you add a test and see for that (and if so add to the whatsnew).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like I can (easily) do this. If resample introduces missing values to an integer series, those get recorded as iNaT. Since the user is expecting nan for missing values, we have to recast the series to float and explicitly convert the iNaT values.

#16674 is about the case where iNaT was in the input dataframe as a legitimate integer value. At this point in the code, there is no way to disambiguate between that and true missing values.

if mask.any():
result = result.astype('float64')
Expand Down
20 changes: 19 additions & 1 deletion pandas/tests/groupby/aggregate/test_cython.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@
from numpy import nan
import pandas as pd

from pandas import bdate_range, DataFrame, Index, Series
from pandas import (bdate_range, DataFrame, Index, Series, Timestamp,
Timedelta, NaT)
from pandas.core.groupby import DataError
from pandas.core.indexes.numeric import Int64Index
import pandas.util.testing as tm


Expand Down Expand Up @@ -187,3 +189,19 @@ def test_cython_agg_empty_buckets_nanops():
{"a": [1, 1, 1716, 1]},
index=pd.CategoricalIndex(intervals, name='a', ordered=True))
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize('op', ['first', 'last', 'max', 'min'])
@pytest.mark.parametrize('data', [
Timestamp('2016-10-14 21:00:44.557'),
Timedelta('17088 days 21:00:44.557'), ])
def test_cython_with_timestamp_and_nat(op, data):
# https://github.com/pandas-dev/pandas/issues/19526
df = DataFrame({'a': [0, 1], 'b': [data, NaT]})
index = Int64Index([0, 1], dtype='int64', name='a')

# We will group by a and test the cython aggregations
expected = DataFrame({'b': [data, NaT]}, index=index)

result = df.groupby('a').aggregate(op)
tm.assert_frame_equal(expected, result)