Aggregate fails with mixed types in grouping series #16916

nreeve17 · 2017-07-13T23:50:20Z

Code Sample, a copy-pastable example if possible

X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
X.groupby('grouping').aggregate(lambda x: x.tolist())

This is the exception and traceback that the code above returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   3482                     result = self._aggregate_multiple_funcs(
-> 3483                         [arg], _level=_level, _axis=self.axis)
   3484                     result.columns = Index(

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
    690         if not len(results):
--> 691             raise ValueError("no results")
    692 

ValueError: no results

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_generic(self, func, *args, **kwargs)
   3508                 for name, data in self:
-> 3509                     result[name] = self._try_cast(func(data, *args, **kwargs),
   3510                                                   data)

<ipython-input-25-18b24604e98f> in <lambda>(x)
      2 X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
----> 3 X.groupby('grouping').aggregate(lambda x: x.tolist())

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 

AttributeError: 'DataFrame' object has no attribute 'tolist'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-25-18b24604e98f> in <module>()
      1 X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
      2 X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
----> 3 X.groupby('grouping').aggregate(lambda x: x.tolist())

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   4034         versionadded=''))
   4035     def aggregate(self, arg, *args, **kwargs):
-> 4036         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   4037 
   4038     agg = aggregate

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   3486                         name=self._selected_obj.columns.name)
   3487                 except:
-> 3488                     result = self._aggregate_generic(arg, *args, **kwargs)
   3489 
   3490         if not self.as_index:

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_generic(self, func, *args, **kwargs)
   3510                                                   data)
   3511             except Exception:
-> 3512                 return self._aggregate_item_by_item(func, *args, **kwargs)
   3513         else:
   3514             for name in self.indices:

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   3554             # GH6337
   3555             if not len(result_columns) and errors is not None:
-> 3556                 raise errors
   3557 
   3558         return DataFrame(result, columns=result_columns)

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   3539                                      grouper=self.grouper)
   3540                 result[item] = self._try_cast(
-> 3541                     colg.aggregate(func, *args, **kwargs), data)
   3542             except ValueError:
   3543                 cannot_agg.append(item)

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2885                 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
   2886 
-> 2887             index = Index(sorted(result), name=self.grouper.names[0])
   2888             ret = Series(result, index=index)
   2889 

TypeError: unorderable types: str() < int()

Problem description

If a grouping vector is of mixed type and aggregate is used after groupby(...), an exception will be raised. The source code will get to this line and fails because sorted() does not support mixed types.

Expected Output

This is what we would expect to see if the exception was not raised. This output was achieved by using a column in groupby that is of a single type. In this instance, 2 was changed to a string

X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
X['grouping'] = ['group 1', 'group 1', 'group 1', '2', '2' , '2', 'group 1']
X.groupby('grouping').aggregate(lambda x: x.tolist())

                                                          X  \
grouping                                                      
2         [0.9219120799240533, 0.6439069401684864, 0.035...   
group 1   [0.6884732212797477, 0.326906484996646, 0.6718...   

                                                          Y  \
grouping                                                      
2         [0.7796923828539405, 0.7668459596180287, 0.868...   
group 1   [0.20259205506065203, 0.9138593138141587, 0.95...   

                                                          Z  
grouping                                                     
2         [0.9863526134877422, 0.6342347501171951, 0.873...  
group 1   [0.054465751087565906, 0.9026560581041934, 0.9...

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

cc @ElDeveloper

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-14T00:58:41Z

@nreeve17 : I presume that if you change that offending to:

index = Index(result, name=self.grouper.names[0])

your code example would work?

@jreback : Could we just to try-except with sorted(...) and then instantiate the Series separately?

jreback · 2017-07-14T15:08:46Z

simpler example

In [20]: Index([0, '1']).sort_values()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-c2949e5a9d7f> in <module>()
----> 1 Index([0, '1']).sort_values()

/Users/jreback/pandas/pandas/core/indexes/base.py in sort_values(self, return_indexer, ascending)
   2026         Return sorted copy of Index
   2027         """
-> 2028         _as = self.argsort()
   2029         if not ascending:
   2030             _as = _as[::-1]

/Users/jreback/pandas/pandas/core/indexes/base.py in argsort(self, *args, **kwargs)
   2089         if result is None:
   2090             result = np.array(self)
-> 2091         return result.argsort(*args, **kwargs)
   2092 
   2093     def __add__(self, other):

TypeError: '>' not supported between instances of 'str' and 'int'

so to clean up some things. I would move pandas/core/algos/safe_sort to pandas/core/sorting (just to clean up a bit). Then this can be selectively used where needed (in a try/except)

gfyoung · 2017-07-14T15:15:06Z

@nreeve17 : How do these solutions sound to you? Did the quick-fix work for you BTW?

nreeve17 · 2017-07-14T20:18:23Z

@gfyoung yes I tried that quick fix and it worked. I ran all the tests on pandas with the modification to groupby.py and nothing seems to be broken. Should I add a test case and submit a pull request?

gfyoung · 2017-07-14T21:30:03Z

@nreeve17 : Awesome that that worked! However, a more robust solution (and one that will be more likely to be merged) is to do a try-except on the sorting call and then pass in result to Series. @jreback also suggested some refactoring behind the scenes above, but at the very least, implement the try-except and then submit that as a PR (with your test).

Fixes issue pandas-dev#16916, where using aggregate on a mixed type grouping vector fails. Added test in test_aggregate.py to ensure that the bug is fixed.

mroeschke · 2019-10-26T04:54:28Z

This looks to be fixed on master. Could use a test.

In [128]: X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
     ...: X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
     ...: X.groupby('grouping').aggregate(lambda x: x.tolist())
Out[128]:
                                                          X  ...                                                  Z
grouping                                                     ...
2         [0.8198860820544791, 0.9156085166840109, 0.075...  ...  [0.928978831584153, 0.8276988600820108, 0.1694...
group 1   [0.2072740165365099, 0.5195836363398144, 0.038...  ...  [0.9497574283642745, 0.7137629888625677, 0.478...

[2 rows x 3 columns]

In [130]: pd.__version__
Out[130]: '0.26.0.dev0+682.g08ab156eb'

gfyoung added the Groupby label Jul 14, 2017

gfyoung added the Bug label Jul 14, 2017

jreback added this to the Next Major Release milestone Jul 14, 2017

jreback added Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions labels Jul 14, 2017

nreeve17 mentioned this issue Jul 18, 2017

BUG: fixed issue with mixed type groupby aggregate #17003

Closed

4 tasks

gfyoung modified the milestones: 0.21.0, Next Major Release Jul 18, 2017

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

jbrockmendel added Apply Apply, Aggregate, Transform, Map and removed Effort Medium labels Oct 16, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Oct 26, 2019

mroeschke mentioned this issue Jan 21, 2020

TST: Add more regression tests for fixed issues #31171

Merged

10 tasks

jreback removed this from the Contributions Welcome milestone Jan 21, 2020

jreback added this to the 1.1 milestone Jan 21, 2020

mroeschke closed this as completed in #31171 Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate fails with mixed types in grouping series #16916

Aggregate fails with mixed types in grouping series #16916

nreeve17 commented Jul 13, 2017 •

edited

Loading

gfyoung commented Jul 14, 2017

jreback commented Jul 14, 2017

gfyoung commented Jul 14, 2017

nreeve17 commented Jul 14, 2017

gfyoung commented Jul 14, 2017

mroeschke commented Oct 26, 2019

Aggregate fails with mixed types in grouping series #16916

Aggregate fails with mixed types in grouping series #16916

Comments

nreeve17 commented Jul 13, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

gfyoung commented Jul 14, 2017

jreback commented Jul 14, 2017

gfyoung commented Jul 14, 2017

nreeve17 commented Jul 14, 2017

gfyoung commented Jul 14, 2017

mroeschke commented Oct 26, 2019

nreeve17 commented Jul 13, 2017 •

edited

Loading

Output of `pd.show_versions()`