Skip to content

DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thorbjornwolf opened this issue Jun 10, 2016 · 4 comments · Fixed by #27788
Closed

DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

thorbjornwolf opened this issue Jun 10, 2016 · 4 comments · Fixed by #27788
Labels
Categorical Categorical Data Type Groupby
Milestone

Comments

@thorbjornwolf
Copy link

While attempting to use pd.qcut (which returned a Categorical) to bin some data in groups for plotting, I encountered the following error. The idea is to group a DataFrame by columns (axis=1) using a Categorical.

Minimal breaking example

>>> import pandas
>>> df = pandas.DataFrame({'a':[1,2,3,4], 'b':[-1,-2,-3,-4], 'c':[5,6,7,8]})
>>> df
   a  b  c
0  1 -1  5
1  2 -2  6
2  3 -3  7
3  4 -4  8
>>> grp = pandas.Categorical([1,0,1])
>>> df.groupby(grp, axis=1).mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 3778, in groupby
    **kwargs)
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1427, in groupby
    return klass(obj, by, **kwds)
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py", line 354, in __init__
    mutated=self.mutated)
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py", line 2390, in _get_grouper
    raise ValueError("Categorical dtype grouper must "
ValueError: Categorical dtype grouper must have len(grouper) == len(data)

Expected behaviour

Same as

>>> df.T.groupby(grp, axis=0).mean().T
   0  1
0 -1  3
1 -2  4
2 -3  5
3 -4  6

So, it works as expected when doubly transposed. This makes it appear as a bug to me.

Proposed solution

In if is_categorical_dtype(gpr) and len(gpr) != len(obj):, change len(obj) to obj.shape[axis]. This assumes that len(obj) == obj.shape[0] for all obj.

So, supposing you agree that this is a bug, should a test be put in test_groupby_categorical?

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 22.0.5
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.1
xlsxwriter: 0.8.9
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

Your grouper is not a valid categorical as it doesn't map anything. Though this still fails.

In [30]: grp = pd.Categorical.from_codes([1,0,1],categories=list('abc'))

In [31]: grp
Out[31]: 
[b, a, b]
Categories (3, object): [a, b, c]

In [32]: grp.codes
Out[32]: array([1, 0, 1], dtype=int8)

So i'd say this is a bug, but need a bit of workout on the tests.

@jreback jreback added this to the Next Major Release milestone Jun 10, 2016
@thorbjornwolf
Copy link
Author

That's great!

A question for your answer, though: You say that grp = pd.Categorical([1,0,1]) is

not a valid categorical as it doesn't map anything.

What do you mean by this? The counter-example shown above has the categories given explicitly, but the first example (giving only values) should work fine, as the categories, if not given, are assumed to be the unique values of values.. What am I missing?

Small demo of codes and categories from first example

In [4]: grp = pd.Categorical([1,0,1])

In [5]: grp
Out[5]: 
[1, 0, 1]
Categories (2, int64): [0, 1]

In [6]: grp.codes
Out[6]: array([1, 0, 1], dtype=int8)

In [7]: grp.categories
Out[7]: Int64Index([0, 1], dtype='int64')

Thank you for your work!

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

the problem in your example is that nothing maps
iow need to map the column names to groups

but you are mapping integers - I don't think we error on this but everything gets into the man group and it should return an empty frame I think

@naure
Copy link

naure commented Sep 25, 2017

This is in fact related to grouping by categories. Here is an example:

In [1]: import pandas
   ...: df = pandas.DataFrame({'A': ["pos", "neg", "pos"], 'B': [1, -1, 2]})
   ...: df.A = df.A.astype("category")
   ...: df
Out[1]:
     A  B
0  pos  1
1  neg -1
2  pos  2

In [2]: grp = df.A[1:]                      # Same indexing, different lengths

In [4]: df.groupby(grp).mean()              # Categorical + different length = bug

~/Library/Python/3.6/lib/python/site-packages/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort, mutated)
   2624
   2625         if is_categorical_dtype(gpr) and len(gpr) != len(obj):
-> 2626             raise ValueError("Categorical dtype grouper must "
   2627                              "have len(grouper) == len(data)")
   2628

ValueError: Categorical dtype grouper must have len(grouper) == len(data)

In [5]: df.groupby(grp.astype(str)).mean()  # Convert to string to avoid the buggy check
Out[5]:
     B
A
neg -1
pos  2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants