DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

thorbjornwolf · 2016-06-10T10:17:09Z

While attempting to use pd.qcut (which returned a Categorical) to bin some data in groups for plotting, I encountered the following error. The idea is to group a DataFrame by columns (axis=1) using a Categorical.

Minimal breaking example

>>> import pandas
>>> df = pandas.DataFrame({'a':[1,2,3,4], 'b':[-1,-2,-3,-4], 'c':[5,6,7,8]})
>>> df
   a  b  c
0  1 -1  5
1  2 -2  6
2  3 -3  7
3  4 -4  8
>>> grp = pandas.Categorical([1,0,1])
>>> df.groupby(grp, axis=1).mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 3778, in groupby
    **kwargs)
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1427, in groupby
    return klass(obj, by, **kwds)
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py", line 354, in __init__
    mutated=self.mutated)
  File "/home/ntawolf/anaconda3/lib/python3.5/site-packages/pandas/core/groupby.py", line 2390, in _get_grouper
    raise ValueError("Categorical dtype grouper must "
ValueError: Categorical dtype grouper must have len(grouper) == len(data)

Expected behaviour

Same as

>>> df.T.groupby(grp, axis=0).mean().T
   0  1
0 -1  3
1 -2  4
2 -3  5
3 -4  6

So, it works as expected when doubly transposed. This makes it appear as a bug to me.

Proposed solution

In if is_categorical_dtype(gpr) and len(gpr) != len(obj):, change len(obj) to obj.shape[axis]. This assumes that len(obj) == obj.shape[0] for all obj.

So, supposing you agree that this is a bug, should a test be put in test_groupby_categorical?

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 22.0.5
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.1
xlsxwriter: 0.8.9
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-06-10T12:50:48Z

Your grouper is not a valid categorical as it doesn't map anything. Though this still fails.

In [30]: grp = pd.Categorical.from_codes([1,0,1],categories=list('abc'))

In [31]: grp
Out[31]: 
[b, a, b]
Categories (3, object): [a, b, c]

In [32]: grp.codes
Out[32]: array([1, 0, 1], dtype=int8)

So i'd say this is a bug, but need a bit of workout on the tests.

thorbjornwolf · 2016-06-10T13:19:26Z

That's great!

A question for your answer, though: You say that grp = pd.Categorical([1,0,1]) is

not a valid categorical as it doesn't map anything.

What do you mean by this? The counter-example shown above has the categories given explicitly, but the first example (giving only values) should work fine, as the categories, if not given, are assumed to be the unique values of values.. What am I missing?

Small demo of codes and categories from first example

In [4]: grp = pd.Categorical([1,0,1])

In [5]: grp
Out[5]: 
[1, 0, 1]
Categories (2, int64): [0, 1]

In [6]: grp.codes
Out[6]: array([1, 0, 1], dtype=int8)

In [7]: grp.categories
Out[7]: Int64Index([0, 1], dtype='int64')

Thank you for your work!

jreback · 2016-06-10T18:45:20Z

the problem in your example is that nothing maps
iow need to map the column names to groups

but you are mapping integers - I don't think we error on this but everything gets into the man group and it should return an empty frame I think

naure · 2017-09-25T15:07:51Z

This is in fact related to grouping by categories. Here is an example:

In [1]: import pandas
   ...: df = pandas.DataFrame({'A': ["pos", "neg", "pos"], 'B': [1, -1, 2]})
   ...: df.A = df.A.astype("category")
   ...: df
Out[1]:
     A  B
0  pos  1
1  neg -1
2  pos  2

In [2]: grp = df.A[1:]                      # Same indexing, different lengths

In [4]: df.groupby(grp).mean()              # Categorical + different length = bug

~/Library/Python/3.6/lib/python/site-packages/pandas/core/groupby.py in _get_grouper(obj, key, axis, level, sort, mutated)
   2624
   2625         if is_categorical_dtype(gpr) and len(gpr) != len(obj):
-> 2626             raise ValueError("Categorical dtype grouper must "
   2627                              "have len(grouper) == len(data)")
   2628

ValueError: Categorical dtype grouper must have len(grouper) == len(data)

In [5]: df.groupby(grp.astype(str)).mean()  # Convert to string to avoid the buggy check
Out[5]:
     B
A
neg -1
pos  2

jreback added Groupby Categorical Categorical Data Type Difficulty Intermediate labels Jun 10, 2016

jreback added this to the Next Major Release milestone Jun 10, 2016

jreback mentioned this issue Aug 2, 2019

BUG: grouby(axis=1) cannot select column names #27700

Merged

5 tasks

charlesdong1991 mentioned this issue Aug 6, 2019

TST: Add tests for groupby categorical values with axis=1 #27788

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Aug 7, 2019

WillAyd closed this as completed in #27788 Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

thorbjornwolf commented Jun 10, 2016

jreback commented Jun 10, 2016

thorbjornwolf commented Jun 10, 2016

jreback commented Jun 10, 2016

naure commented Sep 25, 2017 •

edited

Loading

DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

DataFrame.groupby(grp, axis=1) with categorical grp breaks #13420

Comments

thorbjornwolf commented Jun 10, 2016

Minimal breaking example

Expected behaviour

Proposed solution

output of pd.show_versions()

jreback commented Jun 10, 2016

thorbjornwolf commented Jun 10, 2016

Small demo of codes and categories from first example

jreback commented Jun 10, 2016

naure commented Sep 25, 2017 • edited Loading

output of `pd.show_versions()`

naure commented Sep 25, 2017 •

edited

Loading