Skip to content

Resampler.nunique counting data more than once #13453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcrist opened this issue Jun 15, 2016 · 9 comments
Closed

Resampler.nunique counting data more than once #13453

jcrist opened this issue Jun 15, 2016 · 9 comments
Labels
Bug Resample resample method
Milestone

Comments

@jcrist
Copy link
Contributor

jcrist commented Jun 15, 2016

xref addtl example in #13795

Pandas Resampler.nunique appears to be putting the same data in multiple bins:

import pandas as pd

# Create a series with a datetime index
index = pd.date_range('1-1-2000', '2-15-2000', freq='h')
index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h')
index3 = index.append(index2)
s = pd.Series(range(len(index3)), index=index3)

# Since all elements are unique, `count` and `nunique` should give the same result
count = s.resample('M').count()
nunique = s.resample('M').nunique()

In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should

In [3]: count
Out[3]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [4]: nunique
Out[4]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64

In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the nunique one fails due to a "ValueError: Wrong number of items passed 4, placement implies 5" somewhere in the depths of internals.py. If I go back to 0.16.2, I do get the same result for each.

I'm not sure what's going on here. Since the nunique results sum to larger than the length, it appears data is being counted more than once.


In [19]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
@jcrist
Copy link
Contributor Author

jcrist commented Jun 15, 2016

May be related to #10914.

@jcrist
Copy link
Contributor Author

jcrist commented Jun 15, 2016

Interestingly everything seems to work fine if agg is used with pd.Series.nunique instead:

In [11]: r = s.resample('M')

In [12]: r.agg(pd.Series.nunique)
Out[12]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [13]: r.nunique()    # same result as r.agg('nunique')
Out[13]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64

@sinhrks sinhrks added Resample resample method Bug labels Jun 15, 2016
@sinhrks
Copy link
Member

sinhrks commented Jun 15, 2016

CC: @behzadnouri

@jreback jreback added this to the Next Major Release milestone Jul 26, 2016
@mgalbright
Copy link

I think the root cause of the problem is in groupby.nunique(), which I believe is eventually is called by resample.nunique(). Note that groupby.nunique() has the same bug:

import pandas as pd
from pandas import Timestamp

data = ['1', '2', '3']
time = time = [Timestamp('2016-06-28 09:35:35'), Timestamp('2016-06-28 16:09:30'), Timestamp('2016-06-28 16:46:28')]
test = pd.DataFrame({'time':time, 'data':data})

#wrong counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].nunique(), "\n"
#correct counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].apply(pd.Series.nunique)

This gives

time  
2016-06-28 09:00:00    1  
2016-06-28 10:00:00    0  
2016-06-28 11:00:00    0  
2016-06-28 12:00:00    0  
2016-06-28 13:00:00    0  
2016-06-28 14:00:00    0  
2016-06-28 15:00:00    0  
2016-06-28 16:00:00    1  
Freq: H, Name: data, dtype: int64   

time  
2016-06-28 09:00:00    1  
2016-06-28 10:00:00    0  
2016-06-28 11:00:00    0  
2016-06-28 12:00:00    0  
2016-06-28 13:00:00    0  
2016-06-28 14:00:00    0  
2016-06-28 15:00:00    0  
2016-06-28 16:00:00    2  
Freq: H, Name: data, dtype: int64  

I believe the problem is in the second to last line of groupby.nunique(),
i.e. line 2955 in groupby.py

res[ids] = out

I suspect ids should not be used for the indexing--it has different dimensions than out.

pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Oct 9, 2016

@mgalbright why don't you submit a pull-request with your test examples (and those from the issue), and the proposed fix. See if that breaks anything else. Would be greatly appreciated!

@aiguofer
Copy link
Contributor

aiguofer commented Nov 14, 2016

Hey, is there any advancement on this? I just realized that a report that I've been building is giving the wrong results and I believe it's due to this. I can't share all the code but here's a comparison of groupby.unique and groupby.nunique:

In [216]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].unique().tail(1)
Out[216]: 

startdate
2016-11-12    [550A00000033DHUIA2]
Freq: W-SAT, Name: ent_id, dtype: object

In [217]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].nunique().tail(1)
Out[217]: 

startdate
2016-11-12    7
Freq: W-SAT, Name: ent_id, dtype: int64

In [218]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].count().tail(1)
Out[221]: 

startdate
2016-11-12    1
Freq: W-SAT, Name: ent_id, dtype: int64

@jreback
Copy link
Contributor

jreback commented Nov 14, 2016

@aiguofer pull-requests are welcome to fix.

@hantusk
Copy link

hantusk commented Feb 6, 2017

Not really adding anything, but I just ran into this issue for a work report as well (pandas version 0.19.2). Passing to .agg(pd.Series.nunique) works great - thanks for the tip

aiguofer pushed a commit to aiguofer/pandas that referenced this issue Feb 15, 2017
We only need to use the group boundaries as the index for `res` so that
the dimensions match those of `out`. Fixes pandas-dev#13453
@aiguofer
Copy link
Contributor

Took a look at @mgalbright coment and suggestion and if I'm understanding the code correctly, the above PR should fix it. I ran nosetests pandas/tests/groupby and only had one unrelated test fail (test_series_groupby_value_counts() takes exactly 2 arguments (0 given)).

aiguofer pushed a commit to aiguofer/pandas that referenced this issue Feb 16, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Feb 16, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#13453

Author: Diego Fernandez <[email protected]>

Closes pandas-dev#15418 from aiguofer/gh_13453 and squashes the following commits:

c53bd70 [Diego Fernandez] Add test for pandas-dev#13453 in test_resample and add note to whatsnew
0daab80 [Diego Fernandez] Ensure the right values are set in SeriesGroupBy.nunique
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants