Skip to content

.groupby() .value_counts() incompatible with .reset_index() in 0.18.1 #14014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcroote opened this issue Aug 16, 2016 · 2 comments
Closed

.groupby() .value_counts() incompatible with .reset_index() in 0.18.1 #14014

dcroote opened this issue Aug 16, 2016 · 2 comments

Comments

@dcroote
Copy link

dcroote commented Aug 16, 2016

Code Sample

df = pd.DataFrame([[0,1],[0,1],[0,2],[1,1]], columns=['a','b'])
df
   a  b
0  0  1
1  0  1
2  0  2
3  1  1

df.groupby('a').b.value_counts().reset_index()

ValueError: cannot insert b, already exists

Expected Output

In version 0.18.0, the output was:

   a  b  0
0  0  1  2
1  0  2  1
2  1  1  1
dtype: int64

The difference is that now the groupby() value_counts() operation returns a Series named equivalently to the column on which value_counts() was computed.

df.groupby('a').b.value_counts()

0.18.0

a  b
0  1    2
   2    1
1  1    1
dtype: int64

0.18.1 (including 0.18.1+367.g6b7857b)

a  b
0  1    2
   2    1
1  1    1
Name: b, dtype: int64

This change in behavior is not completely unexpected given that outside of groupby(), value_counts() has historically returned a Series named equivalently to the column the operation was performed on:

df.a.value_counts()
0    3
1    1
Name: a, dtype: int64

A manual workaround would be to rename the Series before reset_index() as follows:

g = df.groupby('a').b.value_counts()
g.name = 0
g.reset_index()
   a  b  0
0  0  1  2
1  0  2  1
2  1  1  1

However, the one-line functionality was much appreciated. Being able to pass a new name to value_counts() could solve this issue?

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1  (also verified with 0.18.1+367.g6b7857b)
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

Probably a result of #12363 fixing groupby sometimes losing the name.

In this case I'd say that

In [37]: df.groupby('a').b.value_counts().reset_index(name='counts')
Out[37]:
   a  b  counts
0  0  1       2
1  0  2       1
2  1  1       1

is even clearer than your original. Thoughts?

@dcroote
Copy link
Author

dcroote commented Aug 16, 2016

Even better, thanks!

@dcroote dcroote closed this as completed Aug 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants