Resampler.nunique counting data more than once #13453

jcrist · 2016-06-15T19:05:16Z

xref addtl example in #13795

Pandas Resampler.nunique appears to be putting the same data in multiple bins:

import pandas as pd

# Create a series with a datetime index
index = pd.date_range('1-1-2000', '2-15-2000', freq='h')
index2 = pd.date_range('4-15-2000', '5-15-2000', freq='h')
index3 = index.append(index2)
s = pd.Series(range(len(index3)), index=index3)

# Since all elements are unique, `count` and `nunique` should give the same result
count = s.resample('M').count()
nunique = s.resample('M').nunique()

In pandas 0.18.1 and 0.18.0 these don't give the same results, when they should

In [3]: count
Out[3]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [4]: nunique
Out[4]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64

In pandas 0.17.0 and 0.17.1 (adjusting to old style resample syntax), the nunique one fails due to a "ValueError: Wrong number of items passed 4, placement implies 5" somewhere in the depths of internals.py. If I go back to 0.16.2, I do get the same result for each.

I'm not sure what's going on here. Since the nunique results sum to larger than the length, it appears data is being counted more than once.

In [19]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jcrist · 2016-06-15T19:26:09Z

May be related to #10914.

jcrist · 2016-06-15T19:32:07Z

Interestingly everything seems to work fine if agg is used with pd.Series.nunique instead:

In [11]: r = s.resample('M')

In [12]: r.agg(pd.Series.nunique)
Out[12]:
2000-01-31    744
2000-02-29    337
2000-03-31      0
2000-04-30    384
2000-05-31    337
Freq: M, dtype: int64

In [13]: r.nunique()    # same result as r.agg('nunique')
Out[13]:
2000-01-31    337
2000-02-29    744
2000-03-31      0
2000-04-30    744
2000-05-31    337
Freq: M, dtype: int64

sinhrks · 2016-06-15T22:22:29Z

CC: @behzadnouri

mgalbright · 2016-10-08T16:54:35Z

I think the root cause of the problem is in groupby.nunique(), which I believe is eventually is called by resample.nunique(). Note that groupby.nunique() has the same bug:

import pandas as pd
from pandas import Timestamp

data = ['1', '2', '3']
time = time = [Timestamp('2016-06-28 09:35:35'), Timestamp('2016-06-28 16:09:30'), Timestamp('2016-06-28 16:46:28')]
test = pd.DataFrame({'time':time, 'data':data})

#wrong counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].nunique(), "\n"
#correct counts
print test.set_index('time').groupby(pd.TimeGrouper(freq='h'))['data'].apply(pd.Series.nunique)

This gives

time  
2016-06-28 09:00:00    1  
2016-06-28 10:00:00    0  
2016-06-28 11:00:00    0  
2016-06-28 12:00:00    0  
2016-06-28 13:00:00    0  
2016-06-28 14:00:00    0  
2016-06-28 15:00:00    0  
2016-06-28 16:00:00    1  
Freq: H, Name: data, dtype: int64   

time  
2016-06-28 09:00:00    1  
2016-06-28 10:00:00    0  
2016-06-28 11:00:00    0  
2016-06-28 12:00:00    0  
2016-06-28 13:00:00    0  
2016-06-28 14:00:00    0  
2016-06-28 15:00:00    0  
2016-06-28 16:00:00    2  
Freq: H, Name: data, dtype: int64

I believe the problem is in the second to last line of groupby.nunique(),
i.e. line 2955 in groupby.py

res[ids] = out

I suspect ids should not be used for the indexing--it has different dimensions than out.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

jreback · 2016-10-09T17:18:02Z

@mgalbright why don't you submit a pull-request with your test examples (and those from the issue), and the proposed fix. See if that breaks anything else. Would be greatly appreciated!

aiguofer · 2016-11-14T15:11:25Z

Hey, is there any advancement on this? I just realized that a report that I've been building is giving the wrong results and I believe it's due to this. I can't share all the code but here's a comparison of groupby.unique and groupby.nunique:

In [216]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].unique().tail(1)
Out[216]: 

startdate
2016-11-12    [550A00000033DHUIA2]
Freq: W-SAT, Name: ent_id, dtype: object

In [217]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].nunique().tail(1)
Out[217]: 

startdate
2016-11-12    7
Freq: W-SAT, Name: ent_id, dtype: int64

In [218]: ents.groupby(pd.Grouper(freq='1W-SAT', key='startdate'))['ent_id'].count().tail(1)
Out[221]: 

startdate
2016-11-12    1
Freq: W-SAT, Name: ent_id, dtype: int64

jreback · 2016-11-14T21:25:00Z

@aiguofer pull-requests are welcome to fix.

hantusk · 2017-02-06T22:07:01Z

Not really adding anything, but I just ran into this issue for a work report as well (pandas version 0.19.2). Passing to .agg(pd.Series.nunique) works great - thanks for the tip

We only need to use the group boundaries as the index for `res` so that the dimensions match those of `out`. Fixes pandas-dev#13453

aiguofer · 2017-02-15T23:27:19Z

Took a look at @mgalbright coment and suggestion and if I'm understanding the code correctly, the above PR should fix it. I ran nosetests pandas/tests/groupby and only had one unrelated test fail (test_series_groupby_value_counts() takes exactly 2 arguments (0 given)).

closes pandas-dev#13453 Author: Diego Fernandez <[email protected]> Closes pandas-dev#15418 from aiguofer/gh_13453 and squashes the following commits: c53bd70 [Diego Fernandez] Add test for pandas-dev#13453 in test_resample and add note to whatsnew 0daab80 [Diego Fernandez] Ensure the right values are set in SeriesGroupBy.nunique

sinhrks added Resample resample method Bug labels Jun 15, 2016

jreback mentioned this issue Jul 26, 2016

BUG: resample nunique calculation incorrect #13795

Closed

jreback added Difficulty Intermediate labels Jul 26, 2016

jreback added this to the Next Major Release milestone Jul 26, 2016

aiguofer pushed a commit to aiguofer/pandas that referenced this issue Feb 15, 2017

Ensure the right values are set in SeriesGroupBy.nunique

0daab80

We only need to use the group boundaries as the index for `res` so that the dimensions match those of `out`. Fixes pandas-dev#13453

aiguofer mentioned this issue Feb 15, 2017

Ensure the right values are set in SeriesGroupBy.nunique #15418

Closed

4 tasks

aiguofer pushed a commit to aiguofer/pandas that referenced this issue Feb 16, 2017

Add test for pandas-dev#13453 in test_resample and add note to whatsnew

c53bd70

jreback closed this as completed in 5a8883b Feb 16, 2017

jreback modified the milestones: 0.20.0, Next Major Release Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resampler.nunique counting data more than once #13453

Resampler.nunique counting data more than once #13453

jcrist commented Jun 15, 2016 •

edited by jreback

Loading

jcrist commented Jun 15, 2016

jcrist commented Jun 15, 2016

sinhrks commented Jun 15, 2016

mgalbright commented Oct 8, 2016

jreback commented Oct 9, 2016

aiguofer commented Nov 14, 2016 •

edited

Loading

jreback commented Nov 14, 2016

hantusk commented Feb 6, 2017 •

edited

Loading

aiguofer commented Feb 15, 2017

Resampler.nunique counting data more than once #13453

Resampler.nunique counting data more than once #13453

Comments

jcrist commented Jun 15, 2016 • edited by jreback Loading

jcrist commented Jun 15, 2016

jcrist commented Jun 15, 2016

sinhrks commented Jun 15, 2016

mgalbright commented Oct 8, 2016

jreback commented Oct 9, 2016

aiguofer commented Nov 14, 2016 • edited Loading

jreback commented Nov 14, 2016

hantusk commented Feb 6, 2017 • edited Loading

aiguofer commented Feb 15, 2017

jcrist commented Jun 15, 2016 •

edited by jreback

Loading

aiguofer commented Nov 14, 2016 •

edited

Loading

hantusk commented Feb 6, 2017 •

edited

Loading