Skip to content

Aggregate fails with mixed types in grouping series #16916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nreeve17 opened this issue Jul 13, 2017 · 6 comments · Fixed by #31171
Closed

Aggregate fails with mixed types in grouping series #16916

nreeve17 opened this issue Jul 13, 2017 · 6 comments · Fixed by #31171
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@nreeve17
Copy link

nreeve17 commented Jul 13, 2017

Code Sample, a copy-pastable example if possible

X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
X.groupby('grouping').aggregate(lambda x: x.tolist())

This is the exception and traceback that the code above returns:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   3482                     result = self._aggregate_multiple_funcs(
-> 3483                         [arg], _level=_level, _axis=self.axis)
   3484                     result.columns = Index(

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
    690         if not len(results):
--> 691             raise ValueError("no results")
    692 

ValueError: no results

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_generic(self, func, *args, **kwargs)
   3508                 for name, data in self:
-> 3509                     result[name] = self._try_cast(func(data, *args, **kwargs),
   3510                                                   data)

<ipython-input-25-18b24604e98f> in <lambda>(x)
      2 X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
----> 3 X.groupby('grouping').aggregate(lambda x: x.tolist())

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 

AttributeError: 'DataFrame' object has no attribute 'tolist'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-25-18b24604e98f> in <module>()
      1 X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
      2 X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
----> 3 X.groupby('grouping').aggregate(lambda x: x.tolist())

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   4034         versionadded=''))
   4035     def aggregate(self, arg, *args, **kwargs):
-> 4036         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   4037 
   4038     agg = aggregate

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, arg, *args, **kwargs)
   3486                         name=self._selected_obj.columns.name)
   3487                 except:
-> 3488                     result = self._aggregate_generic(arg, *args, **kwargs)
   3489 
   3490         if not self.as_index:

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_generic(self, func, *args, **kwargs)
   3510                                                   data)
   3511             except Exception:
-> 3512                 return self._aggregate_item_by_item(func, *args, **kwargs)
   3513         else:
   3514             for name in self.indices:

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   3554             # GH6337
   3555             if not len(result_columns) and errors is not None:
-> 3556                 raise errors
   3557 
   3558         return DataFrame(result, columns=result_columns)

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   3539                                      grouper=self.grouper)
   3540                 result[item] = self._try_cast(
-> 3541                     colg.aggregate(func, *args, **kwargs), data)
   3542             except ValueError:
   3543                 cannot_agg.append(item)

/Users/nicolaireeve/miniconda2/envs/skbiodev/lib/python3.4/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2885                 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
   2886 
-> 2887             index = Index(sorted(result), name=self.grouper.names[0])
   2888             ret = Series(result, index=index)
   2889 

TypeError: unorderable types: str() < int()

Problem description

If a grouping vector is of mixed type and aggregate is used after groupby(...), an exception will be raised. The source code will get to this line and fails because sorted() does not support mixed types.

Expected Output

This is what we would expect to see if the exception was not raised. This output was achieved by using a column in groupby that is of a single type. In this instance, 2 was changed to a string

X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
X['grouping'] = ['group 1', 'group 1', 'group 1', '2', '2' , '2', 'group 1']
X.groupby('grouping').aggregate(lambda x: x.tolist())

                                                          X  \
grouping                                                      
2         [0.9219120799240533, 0.6439069401684864, 0.035...   
group 1   [0.6884732212797477, 0.326906484996646, 0.6718...   

                                                          Y  \
grouping                                                      
2         [0.7796923828539405, 0.7668459596180287, 0.868...   
group 1   [0.20259205506065203, 0.9138593138141587, 0.95...   

                                                          Z  
grouping                                                     
2         [0.9863526134877422, 0.6342347501171951, 0.873...  
group 1   [0.054465751087565906, 0.9026560581041934, 0.9...  

Output of pd.show_versions()

# Paste the output here pd.show_versions() here
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

cc @ElDeveloper

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@nreeve17 : I presume that if you change that offending to:

index = Index(result, name=self.grouper.names[0])

your code example would work?

@jreback : Could we just to try-except with sorted(...) and then instantiate the Series separately?

@gfyoung gfyoung added the Bug label Jul 14, 2017
@jreback
Copy link
Contributor

jreback commented Jul 14, 2017

simpler example

In [20]: Index([0, '1']).sort_values()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-c2949e5a9d7f> in <module>()
----> 1 Index([0, '1']).sort_values()

/Users/jreback/pandas/pandas/core/indexes/base.py in sort_values(self, return_indexer, ascending)
   2026         Return sorted copy of Index
   2027         """
-> 2028         _as = self.argsort()
   2029         if not ascending:
   2030             _as = _as[::-1]

/Users/jreback/pandas/pandas/core/indexes/base.py in argsort(self, *args, **kwargs)
   2089         if result is None:
   2090             result = np.array(self)
-> 2091         return result.argsort(*args, **kwargs)
   2092 
   2093     def __add__(self, other):

TypeError: '>' not supported between instances of 'str' and 'int'

so to clean up some things. I would move pandas/core/algos/safe_sort to pandas/core/sorting (just to clean up a bit). Then this can be selectively used where needed (in a try/except)

@jreback jreback added this to the Next Major Release milestone Jul 14, 2017
@jreback jreback added Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions labels Jul 14, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@nreeve17 : How do these solutions sound to you? Did the quick-fix work for you BTW?

@nreeve17
Copy link
Author

@gfyoung yes I tried that quick fix and it worked. I ran all the tests on pandas with the modification to groupby.py and nothing seems to be broken. Should I add a test case and submit a pull request?

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@nreeve17 : Awesome that that worked! However, a more robust solution (and one that will be more likely to be merged) is to do a try-except on the sorting call and then pass in result to Series. @jreback also suggested some refactoring behind the scenes above, but at the very least, implement the try-except and then submit that as a PR (with your test).

nreeve17 pushed a commit to nreeve17/pandas that referenced this issue Jul 18, 2017
Fixes issue pandas-dev#16916, where using aggregate on a mixed type grouping
vector fails. Added test in test_aggregate.py to ensure that the bug
is fixed.
nreeve17 pushed a commit to nreeve17/pandas that referenced this issue Jul 18, 2017
Fixes issue pandas-dev#16916, where using aggregate on a mixed type grouping
vector fails. Added test in test_aggregate.py to ensure that the bug
is fixed.
@gfyoung gfyoung modified the milestones: 0.21.0, Next Major Release Jul 18, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map and removed Effort Medium labels Oct 16, 2019
@mroeschke
Copy link
Member

This looks to be fixed on master. Could use a test.

In [128]: X = pd.DataFrame(data=np.random.rand(7, 3), columns=list('XYZ'), index=list('zxcvbnm'))
     ...: X['grouping'] = ['group 1', 'group 1', 'group 1', 2, 2 , 2, 'group 1']
     ...: X.groupby('grouping').aggregate(lambda x: x.tolist())
Out[128]:
                                                          X  ...                                                  Z
grouping                                                     ...
2         [0.8198860820544791, 0.9156085166840109, 0.075...  ...  [0.928978831584153, 0.8276988600820108, 0.1694...
group 1   [0.2072740165365099, 0.5195836363398144, 0.038...  ...  [0.9497574283642745, 0.7137629888625677, 0.478...

[2 rows x 3 columns]

In [130]: pd.__version__
Out[130]: '0.26.0.dev0+682.g08ab156eb'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Oct 26, 2019
@jreback jreback removed this from the Contributions Welcome milestone Jan 21, 2020
@jreback jreback added this to the 1.1 milestone Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants