Skip to content

BUG: Fix ts precision issue with groupby and NaT (#19526) #19530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jbandlow
Copy link
Contributor

@jbandlow jbandlow commented Feb 4, 2018

@codecov
Copy link

codecov bot commented Feb 4, 2018

Codecov Report

Merging #19530 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19530      +/-   ##
==========================================
+ Coverage    91.6%    91.6%   +<.01%     
==========================================
  Files         150      150              
  Lines       48750    48750              
==========================================
+ Hits        44657    44658       +1     
+ Misses       4093     4092       -1
Flag Coverage Δ
#multiple 89.97% <100%> (ø) ⬆️
#single 41.75% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.17% <100%> (ø) ⬆️
pandas/io/parsers.py 95.56% <0%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93c86aa...2fb23d6. Read the comment docs.

@@ -548,7 +548,7 @@ Groupby/Resample/Rolling
- Bug in :func:`DataFrame.resample` which silently ignored unsupported (or mistyped) options for ``label``, ``closed`` and ``convention`` (:issue:`19303`)
- Bug in :func:`DataFrame.groupby` where tuples were interpreted as lists of keys rather than as keys (:issue:`17979`, :issue:`18249`)
- Bug in ``transform`` where particular aggregation functions were being incorrectly cast to match the dtype(s) of the grouped data (:issue:`19200`)
-
- Bug in :func:`DataFrame.groupby` where the use of cython aggregation functions was causing timestamps to lose precision (:issue:`19526`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you list them (first/last/min/max)

@@ -2758,6 +2758,27 @@ def test_tuple_correct_keyerror(self):
with tm.assert_raises_regex(KeyError, "(7, 8)"):
df.groupby((7, 8)).mean()

def test_cython_with_timestamp_and_nat(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over them instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also parameterize over a timedelta dtyped Series as well

# https://github.com/pandas-dev/pandas/issues/19526
ts = pd.Timestamp('2016-10-14 21:00:44.557')
df = pd.DataFrame({'a': [0, 1], 'b': [ts, pd.NaT]})
index = pd.Int64Index([0, 1], dtype='int64', name='a')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write the expected like
expected = DataFrame({'b': [ts, NaT]}, index=Index([0, 1, name='a'))

note we don't use pd. anywhere (the imports are at the top)

# We will group by a and test the cython aggregations
expected = pd.DataFrame({'b': [ts, pd.NaT]}, index=index)

result = df.groupby('a').max()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test needs to be move to pandas/groupby/aggregate/test_cython.py

@@ -2324,7 +2324,7 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1):
result = self._transform(
result, values, labels, func, is_numeric, is_datetimelike)

if is_integer_dtype(result):
if is_integer_dtype(result) and not is_datetimelike:
mask = result == iNaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can then remove this masking bizness (inside the is_integer_dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might also fix #16674, can you add a test and see for that (and if so add to the whatsnew).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like I can (easily) do this. If resample introduces missing values to an integer series, those get recorded as iNaT. Since the user is expecting nan for missing values, we have to recast the series to float and explicitly convert the iNaT values.

#16674 is about the case where iNaT was in the input dataframe as a legitimate integer value. At this point in the code, there is no way to disambiguate between that and true missing values.

@jreback jreback added Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Feb 4, 2018
@jbandlow jbandlow force-pushed the timestamp_float_conversion branch from 8071bbb to af37225 Compare February 6, 2018 15:22
@jbandlow jbandlow force-pushed the timestamp_float_conversion branch from a44916a to 2fb23d6 Compare February 6, 2018 17:58
@jreback jreback added this to the 0.23.0 milestone Feb 6, 2018
@jreback jreback closed this in 983d71f Feb 6, 2018
@jreback
Copy link
Contributor

jreback commented Feb 6, 2018

thanks!

keep em coming!

@jbandlow
Copy link
Contributor Author

jbandlow commented Feb 6, 2018

thanks! this was a good experience for a first commit.

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
closes pandas-dev#19526

Author: Jason Bandlow <[email protected]>

Closes pandas-dev#19530 from jbandlow/timestamp_float_conversion and squashes the following commits:

2fb23d6 [Jason Bandlow] merge
af37225 [Jason Bandlow] BUG: Fix ts precision issue with groupby and NaT (pandas-dev#19526)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rounding errors with Timestamps and .last()
2 participants