Skip to content

REGR: groupby().agg fails on categorical column in pandas 1.0.0 #31450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarekOzana opened this issue Jan 30, 2020 · 3 comments · Fixed by #31668
Closed

REGR: groupby().agg fails on categorical column in pandas 1.0.0 #31450

MarekOzana opened this issue Jan 30, 2020 · 3 comments · Fixed by #31668
Labels
Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@MarekOzana
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
tbl = pd.DataFrame({"col_num": [1, 1, 2, 3]})
tbl["col_cat"] = tbl["col_num"].astype("category")

# The following line works in 0.25.3 but throws exception in 1.0.0
df = tbl.groupby("col_num").agg({"col_cat": "first"})

Problem description

agg() on categorical column works without any warnings in 0.25.3 but throws exception in pandas 1.0.0:

NotImplementedError Traceback (most recent call last) in ----> 1 df = tbl.groupby("col_num").agg({"col_cat": "first"})

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
938 func = _maybe_mangle_lambdas(func)
939
--> 940 result, how = self._aggregate(func, *args, **kwargs)
941 if how is None:
942 return result

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
426
427 try:
--> 428 result = _agg(arg, _agg_1dim)
429 except SpecificationError:
430

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\base.py in _agg(arg, func)
393 result = {}
394 for fname, agg_how in arg.items():
--> 395 result[fname] = func(fname, agg_how)
396 return result
397

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\base.py in _agg_1dim(name, how, subset)
377 "nested dictionary is ambiguous in aggregation"
378 )
--> 379 return colg.aggregate(how)
380
381 def _agg_2dim(name, how):

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
245
246 if isinstance(func, str):
--> 247 return getattr(self, func)(*args, **kwargs)
248
249 elif isinstance(func, abc.Iterable):

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\groupby\groupby.py in f(self, **kwargs)
1376 # try a cython aggregation if we can
1377 try:
-> 1378 return self._cython_agg_general(alias, alt=npfunc, **kwargs)
1379 except DataError:
1380 pass

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\groupby\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
888
889 result, agg_names = self.grouper.aggregate(
--> 890 obj._values, how, min_count=min_count
891 )
892

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\groupby\ops.py in aggregate(self, values, how, axis, min_count)
579 ) -> Tuple[np.ndarray, Optional[List[str]]]:
580 return self._cython_operation(
--> 581 "aggregate", values, how, axis, min_count=min_count
582 )
583

~\AppData\Local\Continuum\miniconda3\envs\dev37\lib\site-packages\pandas\core\groupby\ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
453 # are not setup for dim transforming
454 if is_categorical_dtype(values) or is_sparse(values):
--> 455 raise NotImplementedError(f"{values.dtype} dtype not supported")
456 elif is_datetime64_any_dtype(values):
457 if how in ["add", "prod", "cumsum", "cumprod"]:

NotImplementedError: category dtype not supported

#### Expected Output no error, just groupped dataframe.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200119
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.0
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.7
numba : None

@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version labels Jan 30, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.1 milestone Jan 30, 2020
@jorisvandenbossche
Copy link
Member

@MarekOzana Thanks for the report!

I can confirm this is a regression. It only seems to happen when performing the "first" reduction on a specific column, when doing it in general on the dataframe without selecting the column, it still works:

In [4]: tbl.groupby("col_num").agg({"col_cat": "first"})  
...
~/scipy/pandas/pandas/core/groupby/ops.py in aggregate(self, values, how, axis, min_count)
    579     ) -> Tuple[np.ndarray, Optional[List[str]]]:
    580         return self._cython_operation(
--> 581             "aggregate", values, how, axis, min_count=min_count
    582         )
    583 

~/scipy/pandas/pandas/core/groupby/ops.py in _cython_operation(self, kind, values, how, axis, min_count, **kwargs)
    453         # are not setup for dim transforming
    454         if is_categorical_dtype(values) or is_sparse(values):
--> 455             raise NotImplementedError(f"{values.dtype} dtype not supported")
    456         elif is_datetime64_any_dtype(values):
    457             if how in ["add", "prod", "cumsum", "cumprod"]:

NotImplementedError: category dtype not supported

In [5]: tbl.groupby("col_num").agg("first") 
Out[5]: 
        col_cat
col_num        
1             1
2             2
3             3

@jorisvandenbossche jorisvandenbossche changed the title groupby().agg fails on categorical column in pandas 1.0.0 REGR: groupby().agg fails on categorical column in pandas 1.0.0 Jan 30, 2020
@MarekOzana
Copy link
Author

Thanks @jorisvandenbossche for your fast response!
I really need the fix for the general case, I am trying to migrate code with large agg dictionary, where each column uses different aggregation function. I can modify test case as follows:
df = tbl.groupby("col_num").agg({"col_cat": "first", "col_num": "count"})
Thank you for your time.

@jorisvandenbossche
Copy link
Member

Yeah, I certainly didn't want to imply that since the other way is working, that the regression is no problem ;) It might just help with pinning down the problem for diagnosing the issue.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 4, 2020
Fixed by falling back to the wrapped Python version when the cython
version raises NotImplementedError.

Closes pandas-dev#31450
jreback pushed a commit that referenced this issue Feb 5, 2020
Fixed by falling back to the wrapped Python version when the cython
version raises NotImplementedError.

Closes #31450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
3 participants