-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix ts precision issue with groupby and NaT (#19526) #19530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #19530 +/- ##
==========================================
+ Coverage 91.6% 91.6% +<.01%
==========================================
Files 150 150
Lines 48750 48750
==========================================
+ Hits 44657 44658 +1
+ Misses 4093 4092 -1
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -548,7 +548,7 @@ Groupby/Resample/Rolling | |||
- Bug in :func:`DataFrame.resample` which silently ignored unsupported (or mistyped) options for ``label``, ``closed`` and ``convention`` (:issue:`19303`) | |||
- Bug in :func:`DataFrame.groupby` where tuples were interpreted as lists of keys rather than as keys (:issue:`17979`, :issue:`18249`) | |||
- Bug in ``transform`` where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (:issue:`19200`) | |||
- | |||
- Bug in :func:`DataFrame.groupby` where the use of cython aggregation functions was causing timestamps to lose precision (:issue:`19526`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you list them (first/last/min/max)
pandas/tests/groupby/test_groupby.py
Outdated
@@ -2758,6 +2758,27 @@ def test_tuple_correct_keyerror(self): | |||
with tm.assert_raises_regex(KeyError, "(7, 8)"): | |||
df.groupby((7, 8)).mean() | |||
|
|||
def test_cython_with_timestamp_and_nat(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize over them instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also parameterize over a timedelta dtyped Series as well
pandas/tests/groupby/test_groupby.py
Outdated
# https://github.com/pandas-dev/pandas/issues/19526 | ||
ts = pd.Timestamp('2016-10-14 21:00:44.557') | ||
df = pd.DataFrame({'a': [0, 1], 'b': [ts, pd.NaT]}) | ||
index = pd.Int64Index([0, 1], dtype='int64', name='a') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can write the expected like
expected = DataFrame({'b': [ts, NaT]}, index=Index([0, 1, name='a'))
note we don't use pd.
anywhere (the imports are at the top)
pandas/tests/groupby/test_groupby.py
Outdated
# We will group by a and test the cython aggregations | ||
expected = pd.DataFrame({'b': [ts, pd.NaT]}, index=index) | ||
|
||
result = df.groupby('a').max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test needs to be move to pandas/groupby/aggregate/test_cython.py
@@ -2324,7 +2324,7 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1): | |||
result = self._transform( | |||
result, values, labels, func, is_numeric, is_datetimelike) | |||
|
|||
if is_integer_dtype(result): | |||
if is_integer_dtype(result) and not is_datetimelike: | |||
mask = result == iNaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can then remove this masking bizness (inside the is_integer_dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might also fix #16674, can you add a test and see for that (and if so add to the whatsnew).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like I can (easily) do this. If resample
introduces missing values to an integer series, those get recorded as iNaT. Since the user is expecting nan for missing values, we have to recast the series to float and explicitly convert the iNaT values.
#16674 is about the case where iNaT was in the input dataframe as a legitimate integer value. At this point in the code, there is no way to disambiguate between that and true missing values.
8071bbb
to
af37225
Compare
a44916a
to
2fb23d6
Compare
thanks! keep em coming! |
thanks! this was a good experience for a first commit. |
closes pandas-dev#19526 Author: Jason Bandlow <[email protected]> Closes pandas-dev#19530 from jbandlow/timestamp_float_conversion and squashes the following commits: 2fb23d6 [Jason Bandlow] merge af37225 [Jason Bandlow] BUG: Fix ts precision issue with groupby and NaT (pandas-dev#19526)
.last()
#19526git diff upstream/master -u -- "*.py" | flake8 --diff