Inconsistent processing of NaT values by GroupBy #29036

sergiykhan · 2019-10-16T17:15:04Z

df = pd.DataFrame({'number': [1, 2, 3], 'time': [pd.NaT, pd.NaT, pd.NaT]})
interval_index = pd.interval_range( start=pd.to_timedelta('10 days'), end=pd.to_timedelta('15 days'), freq='1h' )
df.groupby( pd.cut(df['time'], bins=interval_index) )['number'].count()

The example above results in NaT values being counted in the (12 days 11:00:00, 12 days 12:00:00] range. The expected result is 0.

The behavior is also inconsistent. Tweaks to the interval_index or the groupby operation result in the expected output. For example:

end=pd.to_timedelta('14 days')

or

df.groupby( pd.cut(df['time'], bins=interval_index) )['time'].count()

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 4.4.0-18362-Microsoft machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

sergiykhan · 2020-06-01T01:51:25Z

Still an issue in Pandas 1.0.4 and Python 3.8.2. Updated the output of pd.show_versions().

jreback · 2020-06-01T02:06:40Z

@sergiykhan best way to have something fixed is a PR

mroeschke · 2021-07-21T06:10:22Z

I think this looks correct on master. Could use a test

In [8]: df = pd.DataFrame({'number': [1, 2, 3], 'time': [pd.NaT, pd.NaT, pd.NaT]})
   ...: interval_index = pd.interval_range( start=pd.to_timedelta('10 days'), end=pd.to_timedelta('15 days')
   ...: , freq='1h' )
   ...: df.groupby( pd.cut(df['time'], bins=interval_index) )['number'].count().value_counts()
Out[8]:
0    120
Name: number, dtype: int64

sergiykhan · 2021-07-21T14:41:00Z

I still see the issue in pandas 1.3.0, but not in the master branch.

It has been a while, so I had to take a closer look at my own bug report. It appears that the problem is coming from pd.cut(), not from groupby() as I indicated.

I am curious as to what has been fixed in the master branch.

sergiykhan · 2022-02-03T20:22:06Z

This appears to have been fixed in 1.4.0.

YousraMashkoor · 2022-02-11T22:10:34Z

has the new test cases been added or there's still something left to do for this issue?

arch1baald · 2022-06-10T20:50:53Z

As @sergiykhan mentioned above, the problem was caused not by .groupby() but by .cut().

The bug was fixed in #23980.

So, for input

df = pd.DataFrame({'number': [1, 2, 3], 'time': [pd.NaT, pd.NaT, pd.NaT]})
interval_index = pd.interval_range( start=pd.to_timedelta('10 days'), end=pd.to_timedelta('15 days'), freq='1h' )
df.groupby( pd.cut(df['time'], bins=interval_index) )['number'].count()

You will receive ValueError: Overlapping IntervalIndex is not accepted.

Also, there is a test for this case.

mroeschke · 2022-06-10T20:52:32Z

Great since the root issue is fixed with a test, looks like we can close

sergiykhan · 2022-06-11T01:16:22Z

Just to clarify that my example did not contain an overlapping IntervalIndex. The referenced fix #23980 appears to be unrelated.

In both Pandas 1.3.0 (bug present) and 1.4.2 (bug fixed since 1.4.0), pd.interval_range produces intervals closed on the right (which is default) and thus not overlapping.

pd.interval_range(
    start=pd.to_timedelta('10 days'),
    end=pd.to_timedelta('15 days'),
    freq='1h',
).is_overlapping

False

It is only in the master branch that the behavior has changed to produce intervals closed on both sides.

jbrockmendel added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 16, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 21, 2021

mroeschke closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent processing of NaT values by GroupBy #29036

Inconsistent processing of NaT values by GroupBy #29036

sergiykhan commented Oct 16, 2019 •

edited

Loading

sergiykhan commented Jun 1, 2020

jreback commented Jun 1, 2020

mroeschke commented Jul 21, 2021

sergiykhan commented Jul 21, 2021

sergiykhan commented Feb 3, 2022

YousraMashkoor commented Feb 11, 2022

arch1baald commented Jun 10, 2022

mroeschke commented Jun 10, 2022 •

edited

Loading

sergiykhan commented Jun 11, 2022

Inconsistent processing of NaT values by GroupBy #29036

Inconsistent processing of NaT values by GroupBy #29036

Comments

sergiykhan commented Oct 16, 2019 • edited Loading

Output of pd.show_versions()

sergiykhan commented Jun 1, 2020

jreback commented Jun 1, 2020

mroeschke commented Jul 21, 2021

sergiykhan commented Jul 21, 2021

sergiykhan commented Feb 3, 2022

YousraMashkoor commented Feb 11, 2022

arch1baald commented Jun 10, 2022

mroeschke commented Jun 10, 2022 • edited Loading

sergiykhan commented Jun 11, 2022

sergiykhan commented Oct 16, 2019 •

edited

Loading

Output of `pd.show_versions()`

mroeschke commented Jun 10, 2022 •

edited

Loading