pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

fujiaxiang · 2017-09-18T16:00:34Z

Code Sample, a copy-pastable example if possible

The code is updated following some comments

import pandas as pd
import random
from random import randint

random.seed(2)
data= [['2010-01-06', randint(1,9)],
       ['2010-08-26', randint(1,9)],
       ['2010-09-06', randint(1,9)],
       ['2010-09-16', 10],
       ['2010-09-20', 10],
       ['2010-09-23', 10],
       ['2010-09-24', randint(1,9)],
       ['2010-09-20', randint(1,9)],]

for m in range(1270):
    data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
                randint(1, 121111)])

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1)

Problem description

When a columns is used in TimeGrouper to group, null values are supposed to be ignored. This is indeed correct when dataset is small. However, the above given code demonstrates that when dataset is larger, sometimes distributes null values into some legit dates. Worst of all there was one time it inserted a value in a row and shifted the entire time series downwards. When I compare two grouped series it made me think one is leading another by 1 month, causing significant waste of resources as I was developing a financial model based on large datasets.

Updated comments after further investigation:
This same piece of code behaves different on some different versions, although none of them, including the latest 0.20.3, produces correct results.

Expected Output

True

Output of `pd.show_versions()`

INSTALLED VERSIONS
this is also updated

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-09-18T22:02:08Z

@fujiaxiang : Thanks reporting this! Unfortunately, we can't replicate your code because df_r is not defined in your code. Could you please fix that?

udaykumarbhanu · 2017-09-19T07:08:23Z

`# Your code here
import pandas as pd
import random

data = []
for m in range(12 * 10000):
M = m % 12 + 1
data += [['2017-' + str(M) + '-' + str(m % 28 + 1) if m%28!=0 else 'xx', random.randint(1, 121111)] ]

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()

print((g1==g2).mean() == 1)`

Output:
True
I am getting True result by changing your code 'g2' to below:
g2 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()

jreback · 2017-09-19T10:06:05Z

this is not reproducible. pls set a random seed that is fixed, or construct the frame in a non-random manner.

jreback · 2017-09-19T10:06:35Z

you also are showing an older version of pandas (there was an older bug related to this that was fixed post 0.19.2).

fujiaxiang · 2017-09-19T14:01:44Z

Hi,
Very sorry for the late reply, and for my error in the previous code. It took me quite a while to reproduce the issue too, as I can't use the original dataset to write this comment.

The code should be:

import pandas as pd
import random
from random import randint

random.seed(2)
data= [['2010-01-06', randint(1,9)],
       ['2010-08-26', randint(1,9)],
       ['2010-09-06', randint(1,9)],
       ['2010-09-16', 10],
       ['2010-09-20', 10],
       ['2010-09-23', 10],
       ['2010-09-24', randint(1,9)],
       ['2010-09-20', randint(1,9)],]

for m in range(1270):
    data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
                randint(1, 121111)])

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1)

Output of `pd.show_versions()`

Note newer versions on Cloud PC were also tested and errors persist.
The below is what I use in my local machine.

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

I do note that this same piece of code behaves differently on different versions. On some versions it produces a ValueError; shape mismatch, which does not really make sense as well.

Hope the above clarifies!

fujiaxiang · 2017-09-19T14:14:42Z

Attached the screenshots of my outputs.
You can see how the numbers in g1 are shifted up.

jreback · 2017-09-19T16:29:19Z

yeah looks like we have an invalid comparision somewhere

These are the 'same' operations (though slightly different impl path).

In [9]: df.resample('M', on='date').nunique()
[16:28:10.847 WARNING] /home/jreback/pandas-dev/pandas/core/groupby.py:3167: FutureWarning: In the future, NAT != NAT will be True rather than False.
  inc = np.r_[1, val[1:] != val[:-1]]

Out[9]: 
            date    n
date                 
2010-01-31     1    1
2010-02-28     0    0
2010-03-31     0    0
2010-04-30     0    0
2010-05-31     0    0
...          ...  ...
2010-08-31     5    4
2010-09-30    31  402
2010-10-31    30  397
2010-11-30    31  417
2010-12-31     1    1

[12 rows x 2 columns]

In [10]: df_r.resample('M', on='date').nunique()
Out[10]: 
            date    n
date                 
2010-01-31     1    1
2010-02-28     0    0
2010-03-31     0    0
2010-04-30     0    0
2010-05-31     0    0
...          ...  ...
2010-08-31     1    1
2010-09-30     5    4
2010-10-31    31  402
2010-11-30    30  397
2010-12-31    31  417

[12 rows x 2 columns]

if you can have a look would be appreciated.

gfyoung added the Groupby label Sep 18, 2017

jreback added the Can't Repro label Sep 19, 2017

jreback added Bug Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Can't Repro labels Sep 19, 2017

jreback added this to the Next Major Release milestone Sep 19, 2017

Licht-T mentioned this issue Sep 22, 2017

BUG: Fix groupby nunique with NaT #17624

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.21.0 Sep 22, 2017

jreback closed this as completed in #17624 Sep 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

fujiaxiang commented Sep 18, 2017 •

edited

Loading

INSTALLED VERSIONS
this is also updated

gfyoung commented Sep 18, 2017

udaykumarbhanu commented Sep 19, 2017 •

edited

Loading

jreback commented Sep 19, 2017

jreback commented Sep 19, 2017

fujiaxiang commented Sep 19, 2017

INSTALLED VERSIONS

INSTALLED VERSIONS

fujiaxiang commented Sep 19, 2017

jreback commented Sep 19, 2017

pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

Comments

fujiaxiang commented Sep 18, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

The code is updated following some comments

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS this is also updated

gfyoung commented Sep 18, 2017

udaykumarbhanu commented Sep 19, 2017 • edited Loading

jreback commented Sep 19, 2017

jreback commented Sep 19, 2017

fujiaxiang commented Sep 19, 2017

Output of pd.show_versions()

INSTALLED VERSIONS

INSTALLED VERSIONS

fujiaxiang commented Sep 19, 2017

jreback commented Sep 19, 2017

fujiaxiang commented Sep 18, 2017 •

edited

Loading

Output of `pd.show_versions()`

INSTALLED VERSIONS
this is also updated

udaykumarbhanu commented Sep 19, 2017 •

edited

Loading

Output of `pd.show_versions()`