Skip to content

BUG: When agg or reset_index to grouped DF, insert function causes ValueError. #20566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TakaakiFuruse opened this issue Mar 31, 2018 · 4 comments

Comments

@TakaakiFuruse
Copy link

Code Sample, a copy-pastable example if possible

Case 1

import pandas as pd
df = pd.DataFrame({
               'a' : [1,1,1,2,2,2,3,3,3],
               'b' : [1,2,3,4,5,6,7,8,9],
})
df.groupby('a', as_index=False).agg({'a': 'count'})

Case 2

import pandas as pd
df = pd.DataFrame({
               'a' : [1,1,1,2,2,2,3,3,3],
               'b' : [1,2,3,4,5,6,7,8,9],
})
df.groupby(['a']).agg({'a': 'count'}).reset_index(level=0)

It returns "ValueError: cannot insert a, already exists" for both caes.

Problem description

So far, I could found 2 functions which calls "insert" function without "allow_duplicates=True" option.
One is in "def _insert_inaxis_grouper_inplace" from "pandas/core/groupby.py"

result.insert(0, name, lev)

and another is in "def reset_index" from "pandas/core/frame.py"

 new_obj.insert(0, name, level_values)

For both cases, it result in ValueError as I wrote in Code section.

Expected Output

I think for both cases it should at least return some kind of dataframe without returning ValueError.

  1. It should add a new column even it was duplicated.
  2. It should add a new column with renaming it (say a_1 or a_x as new column).

My concern is

  1. Is this a bug?
  • Should we supposed to aggrecate for "as_index=False" dataframe or do "reset_index" to aggregated dataframe?
  • (i.e. Are there any other way to rename column before doing reset_index or agg?)
  1. If this were a bug, how could I fix it?
  • I could add "allow_duplicate=True" option for both lines.
  • Or, I could rename column by adding "_x" or "_dup" or any kind of suffix.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ja_JP.UTF-8
LOCALE: ja_JP.UTF-8

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Apr 1, 2018

What is the expected result of your code samples?

@TakaakiFuruse
Copy link
Author

TakaakiFuruse commented Apr 1, 2018

@WillAyd

For Case 1,

   a_x  a
0    1  3
1    2  3
2    3  3

or

   a  a
0    1  3
1    2  3
2    3  3

For Case2,

   a  a
0  1  3
1  2  3
2  3  3

or

   a_x  a
0  1  3
1  2  3
2  3  3

@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2018

Your problem is that you are trying to group and aggregate the same column, which doesn't make sense. What you are looking for is below:

In [13]: df.groupby('a', as_index=False).agg({'b': 'count'})
Out[13]: 
   a  b
0  1  3
1  2  3
2  3  3

@TakaakiFuruse
Copy link
Author

Thanks. So this isn't a bug. My mistake.
Let me close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants