Skip to content

SeriesGroupBy.nunique raises IndexError for empty Series with categorical variables #21334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
parasurama opened this issue Jun 5, 2018 · 2 comments · Fixed by #50081
Closed
Assignees
Labels
Bug Categorical Categorical Data Type Groupby

Comments

@parasurama
Copy link

parasurama commented Jun 5, 2018

Code Sample

df = pd.DataFrame({'id': ['1001', '1001', '1002', '1002', '1003', '1004', '1005', '1006'],
                   'cat': ['A', 'B', 'C', 'C', 'C', 'B', 'B', 'D'],
                   'cat2': ['a', 'a', 'b', 'b', 'c', 'd', 'b', 'd']})
df['cat'] = df['cat'].astype('category')
df['cat2'] = df['cat2'].astype('category')

Problem description

If I subset this dataframe so groupby will return some empty Series and apply the following lambda function, it throws an index error IndexError: index 0 is out of bounds for axis 1 with size 0

df2 = df[df['cat2'].isin(['a', 'b'])]
df2.groupby('cat').apply(lambda x: x.groupby('cat2')['id'].agg('nunique'))  # throws IndexError
  • Note that df2.groupby(['cat', 'cat2'])['id'].agg('nunique') doesn't throw an error.
    However, I have .groupby('cat2')['id'].agg('nunique') as part of a separate function. Hence using lambda for illustrative purposes.

  • Also note that the count aggregator doesn't throw an error:
    df2.groupby('cat').apply(lambda x: x.groupby('cat2')['id'].agg('count'))

  • If i don't categorize the variables, it doesn't throw an error

Expected Output

cat cat2
A a 1
B a 1
B b 1
C b 1

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 17.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: 3.2.3
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.25.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 1.0b8
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Groupby Categorical Categorical Data Type labels Jun 6, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 6, 2018

Admittedly, the code sample is a little unusual, but nevertheless, IMO it should work whether the variables are categorical or not.

@jreback : thoughts?

@rhshadrach
Copy link
Member

Here is a simplified example; it happens on any empty categoricals.

ser = pd.Series([1]).astype('category')[:0]
gb = ser.groupby(ser)
result = gb.nunique()
print(result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants