Skip to content

BUG: DataFrame.groupby() on tuple column works only when column name is "key" #14848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dragonator4 opened this issue Dec 10, 2016 · 3 comments
Closed
Milestone

Comments

@dragonator4
Copy link

dragonator4 commented Dec 10, 2016

This is the weirdest bug I have seen in Pandas. But I am guessing (hoping) the fix will not be too difficult.

Code Sample

Consider the following two code blocks:

Block 1: key column is called "k"

>>> df1 = pd.DataFrame({'x' : [1,2,3,4,5]*3,
                        'y' : [10,20,30,40,50]*3,
                        'z' : [100,200,300,400,500]*3})
>>> df1['k'] = [(0,0,1),(0,1,0),(1,0,0)]*5

Block 2: key column is called "key"

>>> df2 = pd.DataFrame({'x' : [1,2,3,4,5]*3,
                        'y' : [10,20,30,40,50]*3,
                        'z' : [100,200,300,400,500]*3})
>>> df2['key'] = [(0,0,1),(0,1,0),(1,0,0)]*5

Note that the same, static data is used, so that nothing else may be different, and hence culpable.

Problem description

Running a simple .groupby().describe() operation produces the following results:

>>> df1.groupby('k').describe()
# No Result

>>> df2.groupby('key').describe()
			x		y		z
key				
(0, 0, 1)	count	5.000000	5.000000	5.000000
         	mean	3.000000	30.000000	300.000000
         	std	1.581139	15.811388	158.113883
         	min	1.000000	10.000000	100.000000
         	25%	2.000000	20.000000	200.000000
         	50%	3.000000	30.000000	300.000000
         	75%	4.000000	40.000000	400.000000
         	max	5.000000	50.000000	500.000000
(0, 1, 0)	count	5.000000	5.000000	5.000000
         	mean	3.000000	30.000000	300.000000
         	std	1.581139	15.811388	158.113883
         	min	1.000000	10.000000	100.000000
         	25%	2.000000	20.000000	200.000000
         	50%	3.000000	30.000000	300.000000
         	75%	4.000000	40.000000	400.000000
         	max	5.000000	50.000000	500.000000
(1, 0, 0)	count	5.000000	5.000000	5.000000
         	mean	3.000000	30.000000	300.000000
         	std	1.581139	15.811388	158.113883
         	min	1.000000	10.000000	100.000000
         	25%	2.000000	20.000000	200.000000
         	50%	3.000000	30.000000	300.000000
         	75%	4.000000	40.000000	400.000000
         	max	5.000000	50.000000	500.000000

Note that groupby().mean(), sum(), and a few others work fine. describe() is the only one I think is causing the problem.

Expected Output

Obviously, the expected output for df1.groupby('k').describe() should be the same as df2.groupby('key').describe().

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.4
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

Tuples in columns in not that well supported/tested, but this looks indeed like a bug. Welcome to look into it!

@dragonator4
Copy link
Author

It was working perfect in 0.19.0. Updating to 0.19.1 broke my code and this turned out to be the cause. I usually don't work with tuples in columns, but my current project called for it. I had the choice of using three columns to store the three values in the tuple, but since tuples are immutable, I chose to use them.

I have workarounds, so it is not high priority for me.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 6, 2017

There's a correlation between the length of the column name and the number of items in the tuples. So 'key' worked in the example, and I think any other 3 letter name would as well because the tuples were of that length. I will try to fix.

Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Jan 10, 2017
Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Jan 10, 2017
Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Jan 12, 2017
Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Jan 12, 2017
Dr-Irv added a commit to Dr-Irv/pandas that referenced this issue Jan 12, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 13, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
… as the Index

closes pandas-dev#14848

Author: Dr-Irv <[email protected]>

Closes pandas-dev#15110 from Dr-Irv/Issue14848 and squashes the following commits:

c18c6cb [Dr-Irv] Undo change to merge.py and make whatsnew a 2 line comment.
db13c3b [Dr-Irv] Use not is_list_like
fbd20f5 [Dr-Irv] Raise error when creating index of tuples with name parameter a string
f3a7a21 [Dr-Irv] Changes per jreback requests
9489cb2 [Dr-Irv] BUG: Fix issue pandas-dev#14848 groupby().describe() on indices containing all tuples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants