Skip to content

Inconsistent processing of NaT values by GroupBy #29036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sergiykhan opened this issue Oct 16, 2019 · 9 comments
Closed

Inconsistent processing of NaT values by GroupBy #29036

sergiykhan opened this issue Oct 16, 2019 · 9 comments
Labels
good first issue Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions

Comments

@sergiykhan
Copy link

sergiykhan commented Oct 16, 2019

df = pd.DataFrame({'number': [1, 2, 3], 'time': [pd.NaT, pd.NaT, pd.NaT]})
interval_index = pd.interval_range( start=pd.to_timedelta('10 days'), end=pd.to_timedelta('15 days'), freq='1h' )
df.groupby( pd.cut(df['time'], bins=interval_index) )['number'].count()

The example above results in NaT values being counted in the (12 days 11:00:00, 12 days 12:00:00] range. The expected result is 0.

The behavior is also inconsistent. Tweaks to the interval_index or the groupby operation result in the expected output. For example:

end=pd.to_timedelta('14 days')

or

df.groupby( pd.cut(df['time'], bins=interval_index) )['time'].count()

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 4.4.0-18362-Microsoft machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@jbrockmendel jbrockmendel added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 16, 2019
@sergiykhan
Copy link
Author

Still an issue in Pandas 1.0.4 and Python 3.8.2. Updated the output of pd.show_versions().

@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

@sergiykhan best way to have something fixed is a PR

@mroeschke
Copy link
Member

I think this looks correct on master. Could use a test

In [8]: df = pd.DataFrame({'number': [1, 2, 3], 'time': [pd.NaT, pd.NaT, pd.NaT]})
   ...: interval_index = pd.interval_range( start=pd.to_timedelta('10 days'), end=pd.to_timedelta('15 days')
   ...: , freq='1h' )
   ...: df.groupby( pd.cut(df['time'], bins=interval_index) )['number'].count().value_counts()
Out[8]:
0    120
Name: number, dtype: int64

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 21, 2021
@sergiykhan
Copy link
Author

I still see the issue in pandas 1.3.0, but not in the master branch.

It has been a while, so I had to take a closer look at my own bug report. It appears that the problem is coming from pd.cut(), not from groupby() as I indicated.

I am curious as to what has been fixed in the master branch.

@sergiykhan
Copy link
Author

This appears to have been fixed in 1.4.0.

@YousraMashkoor
Copy link

has the new test cases been added or there's still something left to do for this issue?

@arch1baald
Copy link
Contributor

As @sergiykhan mentioned above, the problem was caused not by .groupby() but by .cut().

The bug was fixed in #23980.

So, for input

df = pd.DataFrame({'number': [1, 2, 3], 'time': [pd.NaT, pd.NaT, pd.NaT]})
interval_index = pd.interval_range( start=pd.to_timedelta('10 days'), end=pd.to_timedelta('15 days'), freq='1h' )
df.groupby( pd.cut(df['time'], bins=interval_index) )['number'].count()

You will receive ValueError: Overlapping IntervalIndex is not accepted.

Also, there is a test for this case.

@mroeschke
Copy link
Member

mroeschke commented Jun 10, 2022

Great since the root issue is fixed with a test, looks like we can close

@sergiykhan
Copy link
Author

Just to clarify that my example did not contain an overlapping IntervalIndex. The referenced fix #23980 appears to be unrelated.

In both Pandas 1.3.0 (bug present) and 1.4.2 (bug fixed since 1.4.0), pd.interval_range produces intervals closed on the right (which is default) and thus not overlapping.

pd.interval_range(
    start=pd.to_timedelta('10 days'),
    end=pd.to_timedelta('15 days'),
    freq='1h',
).is_overlapping

False

It is only in the master branch that the behavior has changed to produce intervals closed on both sides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

6 participants