Skip to content

pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fujiaxiang opened this issue Sep 18, 2017 · 7 comments · Fixed by #17624
Closed

pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575

fujiaxiang opened this issue Sep 18, 2017 · 7 comments · Fixed by #17624
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@fujiaxiang
Copy link
Member

fujiaxiang commented Sep 18, 2017

Code Sample, a copy-pastable example if possible

The code is updated following some comments

import pandas as pd
import random
from random import randint

random.seed(2)
data= [['2010-01-06', randint(1,9)],
       ['2010-08-26', randint(1,9)],
       ['2010-09-06', randint(1,9)],
       ['2010-09-16', 10],
       ['2010-09-20', 10],
       ['2010-09-23', 10],
       ['2010-09-24', randint(1,9)],
       ['2010-09-20', randint(1,9)],]

for m in range(1270):
    data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
                randint(1, 121111)])

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1)

Problem description

When a columns is used in TimeGrouper to group, null values are supposed to be ignored. This is indeed correct when dataset is small. However, the above given code demonstrates that when dataset is larger, sometimes distributes null values into some legit dates. Worst of all there was one time it inserted a value in a row and shifted the entire time series downwards. When I compare two grouped series it made me think one is leading another by 1 month, causing significant waste of resources as I was developing a financial model based on large datasets.

Updated comments after further investigation:
This same piece of code behaves different on some different versions, although none of them, including the latest 0.20.3, produces correct results.

Expected Output

True

Output of pd.show_versions()

INSTALLED VERSIONS
this is also updated

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Sep 18, 2017

@fujiaxiang : Thanks reporting this! Unfortunately, we can't replicate your code because df_r is not defined in your code. Could you please fix that?

@udaykumarbhanu
Copy link

udaykumarbhanu commented Sep 19, 2017

`# Your code here
import pandas as pd
import random

data = []
for m in range(12 * 10000):
M = m % 12 + 1
data += [['2017-' + str(M) + '-' + str(m % 28 + 1) if m%28!=0 else 'xx', random.randint(1, 121111)] ]

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()

print((g1==g2).mean() == 1)`

Output:
True
I am getting True result by changing your code 'g2' to below:
g2 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()

@jreback
Copy link
Contributor

jreback commented Sep 19, 2017

this is not reproducible. pls set a random seed that is fixed, or construct the frame in a non-random manner.

@jreback
Copy link
Contributor

jreback commented Sep 19, 2017

you also are showing an older version of pandas (there was an older bug related to this that was fixed post 0.19.2).

@fujiaxiang
Copy link
Member Author

Hi,
Very sorry for the late reply, and for my error in the previous code. It took me quite a while to reproduce the issue too, as I can't use the original dataset to write this comment.

The code should be:

import pandas as pd
import random
from random import randint

random.seed(2)
data= [['2010-01-06', randint(1,9)],
       ['2010-08-26', randint(1,9)],
       ['2010-09-06', randint(1,9)],
       ['2010-09-16', 10],
       ['2010-09-20', 10],
       ['2010-09-23', 10],
       ['2010-09-24', randint(1,9)],
       ['2010-09-20', randint(1,9)],]

for m in range(1270):
    data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
                randint(1, 121111)])

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1)

Output of pd.show_versions()

Note newer versions on Cloud PC were also tested and errors persist.
The below is what I use in my local machine.

INSTALLED VERSIONS

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

I do note that this same piece of code behaves differently on different versions. On some versions it produces a ValueError; shape mismatch, which does not really make sense as well.

Hope the above clarifies!

@fujiaxiang
Copy link
Member Author

Attached the screenshots of my outputs.
You can see how the numbers in g1 are shifted up.
code
g1
g2

@jreback
Copy link
Contributor

jreback commented Sep 19, 2017

yeah looks like we have an invalid comparision somewhere

These are the 'same' operations (though slightly different impl path).

In [9]: df.resample('M', on='date').nunique()
[16:28:10.847 WARNING] /home/jreback/pandas-dev/pandas/core/groupby.py:3167: FutureWarning: In the future, NAT != NAT will be True rather than False.
  inc = np.r_[1, val[1:] != val[:-1]]

Out[9]: 
            date    n
date                 
2010-01-31     1    1
2010-02-28     0    0
2010-03-31     0    0
2010-04-30     0    0
2010-05-31     0    0
...          ...  ...
2010-08-31     5    4
2010-09-30    31  402
2010-10-31    30  397
2010-11-30    31  417
2010-12-31     1    1

[12 rows x 2 columns]

In [10]: df_r.resample('M', on='date').nunique()
Out[10]: 
            date    n
date                 
2010-01-31     1    1
2010-02-28     0    0
2010-03-31     0    0
2010-04-30     0    0
2010-05-31     0    0
...          ...  ...
2010-08-31     1    1
2010-09-30     5    4
2010-10-31    31  402
2010-11-30    30  397
2010-12-31    31  417

[12 rows x 2 columns]

if you can have a look would be appreciated.

@jreback jreback added Bug Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Can't Repro labels Sep 19, 2017
@jreback jreback added this to the Next Major Release milestone Sep 19, 2017
@jreback jreback modified the milestones: Next Major Release, 0.21.0 Sep 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants