-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pd.groupby(pd.TimeGrouper()) mishandles null values in dates #17575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@fujiaxiang : Thanks reporting this! Unfortunately, we can't replicate your code because |
`# Your code here data = [] df = pd.DataFrame(data) g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique() print((g1==g2).mean() == 1)` Output: |
this is not reproducible. pls set a random seed that is fixed, or construct the frame in a non-random manner. |
you also are showing an older version of pandas (there was an older bug related to this that was fixed post 0.19.2). |
Hi, The code should be: import pandas as pd
import random
from random import randint
random.seed(2)
data= [['2010-01-06', randint(1,9)],
['2010-08-26', randint(1,9)],
['2010-09-06', randint(1,9)],
['2010-09-16', 10],
['2010-09-20', 10],
['2010-09-23', 10],
['2010-09-24', randint(1,9)],
['2010-09-20', randint(1,9)],]
for m in range(1270):
data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
randint(1, 121111)])
df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]
g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1) Output of
|
yeah looks like we have an invalid comparision somewhere These are the 'same' operations (though slightly different impl path).
if you can have a look would be appreciated. |
Code Sample, a copy-pastable example if possible
The code is updated following some comments
Problem description
When a columns is used in TimeGrouper to group, null values are supposed to be ignored. This is indeed correct when dataset is small. However, the above given code demonstrates that when dataset is larger, sometimes distributes null values into some legit dates. Worst of all there was one time it inserted a value in a row and shifted the entire time series downwards. When I compare two grouped series it made me think one is leading another by 1 month, causing significant waste of resources as I was developing a financial model based on large datasets.
Updated comments after further investigation:
This same piece of code behaves different on some different versions, although none of them, including the latest 0.20.3, produces correct results.
Expected Output
True
Output of
pd.show_versions()
INSTALLED VERSIONS
this is also updated
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: