-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Resampler.nunique counting data more than once #13453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
May be related to #10914. |
Interestingly everything seems to work fine if
|
CC: @behzadnouri |
I think the root cause of the problem is in groupby.nunique(), which I believe is eventually is called by resample.nunique(). Note that groupby.nunique() has the same bug: import pandas as pd
from pandas import Timestamp
data = ['1', '2', '3']
time = time = [Timestamp('2016-06-28 09:35:35'), Timestamp('2016-06-28 16:09:30'), Timestamp('2016-06-28 16:46:28')]
test = pd.DataFrame({'time':time, 'data':data})
#wrong counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].nunique(), "\n"
#correct counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].apply(pd.Series.nunique) This gives
I believe the problem is in the second to last line of groupby.nunique(), res[ids] = out I suspect pd.show_versions()
|
@mgalbright why don't you submit a pull-request with your test examples (and those from the issue), and the proposed fix. See if that breaks anything else. Would be greatly appreciated! |
Hey, is there any advancement on this? I just realized that a report that I've been building is giving the wrong results and I believe it's due to this. I can't share all the code but here's a comparison of In [216]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].unique().tail(1)
Out[216]:
startdate
2016-11-12 [550A00000033DHUIA2]
Freq: W-SAT, Name: ent_id, dtype: object
In [217]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].nunique().tail(1)
Out[217]:
startdate
2016-11-12 7
Freq: W-SAT, Name: ent_id, dtype: int64
In [218]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].count().tail(1)
Out[221]:
startdate
2016-11-12 1
Freq: W-SAT, Name: ent_id, dtype: int64 |
@aiguofer pull-requests are welcome to fix. |
Not really adding anything, but I just ran into this issue for a work report as well (pandas version 0.19.2). Passing to .agg(pd.Series.nunique) works great - thanks for the tip |
We only need to use the group boundaries as the index for `res` so that the dimensions match those of `out`. Fixes pandas-dev#13453
Took a look at @mgalbright coment and suggestion and if I'm understanding the code correctly, the above PR should fix it. I ran |
closes pandas-dev#13453 Author: Diego Fernandez <[email protected]> Closes pandas-dev#15418 from aiguofer/gh_13453 and squashes the following commits: c53bd70 [Diego Fernandez] Add test for pandas-dev#13453 in test_resample and add note to whatsnew 0daab80 [Diego Fernandez] Ensure the right values are set in SeriesGroupBy.nunique
xref addtl example in #13795
Pandas
Resampler.nunique
appears to be putting the same data in multiple bins:In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should
In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the
nunique
one fails due to a "ValueError: Wrong number of items passed 4, placement implies 5
" somewhere in the depths ofinternals.py
. If I go back to 0.16.2, I do get the same result for each.I'm not sure what's going on here. Since the
nunique
results sum to larger than the length, it appears data is being counted more than once.The text was updated successfully, but these errors were encountered: