Skip to content

BUG: crosstab fails with categoricals on master #37465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
bashtage opened this issue Oct 28, 2020 · 6 comments · Fixed by #37468
Closed
3 tasks done

BUG: crosstab fails with categoricals on master #37465

bashtage opened this issue Oct 28, 2020 · 6 comments · Fixed by #37468
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@bashtage
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


crosstab no longer works correctly with categorical inputs. This is a regression from 1.1.x.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

g = np.random.default_rng(840812492384587325982704)
a = pd.Series(g.integers(0,3,size=100)).astype("category")
b = pd.Series(g.integers(0,2,size=100)).astype("category")
pd.crosstab(a,b,margins=True,dropna=False)

Produces

Traceback (most recent call last):
  File "C:\git\pandas\pandas\core\indexes\base.py", line 2965, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
    cpdef get_loc(self, object val):
  File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_loc
    self._check_type(val)
  File "pandas\_libs\index_class_helper.pxi", line 93, in pandas._libs.index.Int64Engine._check_type
    raise KeyError(val)
KeyError: 'All'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "pandas\_libs\index.pyx", line 705, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
    indices = [0 if checknull(v) else lev.get_loc(v) + 1
  File "C:\git\pandas\pandas\core\indexes\base.py", line 2963, in get_loc
    casted_key = self._maybe_cast_indexer(key)
  File "C:\git\pandas\pandas\core\indexes\category.py", line 527, in _maybe_cast_indexer
    return self._data._unbox_scalar(key)
  File "C:\git\pandas\pandas\core\arrays\categorical.py", line 1728, in _unbox_scalar
    code = self.categories.get_loc(key)
  File "C:\git\pandas\pandas\core\indexes\base.py", line 2967, in get_loc
    raise KeyError(key) from err
KeyError: 'All'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\git\pandas\pandas\core\generic.py", line 3776, in _set_item
    loc = self._info_axis.get_loc(key)
  File "C:\git\pandas\pandas\core\indexes\multi.py", line 2795, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 708, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
    raise KeyError(key)
KeyError: ('__dummy__', 'All')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-ba4a70b3d31b>", line 7, in <module>
    pd.crosstab(a,b,margins=True,dropna=False)
  File "C:\git\pandas\pandas\core\reshape\pivot.py", line 623, in crosstab
    **kwargs,
  File "C:\git\pandas\pandas\core\frame.py", line 6972, in pivot_table
    observed=observed,
  File "C:\git\pandas\pandas\core\reshape\pivot.py", line 180, in pivot_table
    fill_value=fill_value,
  File "C:\git\pandas\pandas\core\reshape\pivot.py", line 242, in _add_margins
    table, data, values, rows, cols, aggfunc, observed, margins_name
  File "C:\git\pandas\pandas\core\reshape\pivot.py", line 331, in _generate_marginal_results
    piece[all_key] = margin[key]
  File "C:\git\pandas\pandas\core\frame.py", line 3083, in __setitem__
    self._set_item(key, value)
  File "C:\git\pandas\pandas\core\frame.py", line 3160, in _set_item
    NDFrame._set_item(self, key, value)
  File "C:\git\pandas\pandas\core\generic.py", line 3779, in _set_item
    self._mgr.insert(len(self._info_axis), key, value)
  File "C:\git\pandas\pandas\core\internals\managers.py", line 1173, in insert
    new_axis = self.items.insert(loc, item)
  File "C:\git\pandas\pandas\core\indexes\multi.py", line 3667, in insert
    level = level.insert(lev_loc, k)
  File "C:\git\pandas\pandas\core\indexes\category.py", line 695, in insert
    code = self._data._validate_insert_value(item)
  File "C:\git\pandas\pandas\core\arrays\categorical.py", line 1186, in _validate_insert_value
    return self._validate_fill_value(value)
  File "C:\git\pandas\pandas\core\arrays\categorical.py", line 1222, in _validate_fill_value
    f"'fill_value={fill_value}' is not present "
ValueError: 'fill_value=All' is not present in this Categorical's categories

Problem description

Previously produced the correct output. A simple regression.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d89331b
python : 3.7.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.2.0.dev0+948.gd89331b96
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.0.post20201006
Cython : 0.29.21
pytest : 6.1.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: v0.10.0dev0+30.gadb67b2
bs4 : 4.8.0
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

@bashtage bashtage added Bug Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 28, 2020
bashtage added a commit to bashtage/pandas that referenced this issue Oct 28, 2020
Change catch types to reflect error changes

closes pandas-dev#37465
@jbrockmendel
Copy link
Member

The exception message makes me think this is likely due to one of my recent PRs. I'll take a look.

@bashtage
Copy link
Contributor Author

bashtage commented Oct 28, 2020

I suspect the wrong exception might be being raised. This said, ValueError does seem right when looking up a value in a Categorical index that is missing.

@jreback jreback modified the milestones: 1.1.5, 1.1.4 Oct 29, 2020
bashtage added a commit to bashtage/pandas that referenced this issue Oct 29, 2020
Change catch types to reflect error changes

closes pandas-dev#37465
@simonjayhawkins
Copy link
Member

This is working on 1.1.3. will bisect shortly.

@jreback jreback modified the milestones: 1.1.4, 1.2 Oct 30, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Oct 30, 2020
bashtage added a commit to bashtage/pandas that referenced this issue Oct 30, 2020
Change catch types to reflect error changes

closes pandas-dev#37465
@simonjayhawkins
Copy link
Member

first bad commit: [e2785f7] REF: de-duplicate Categorical validators (#36558)

@bashtage
Copy link
Contributor Author

I checked the 1.1.x branch and this has not been backported to it.

@bashtage
Copy link
Contributor Author

Pretty obvious the error type was changed.

bashtage added a commit to bashtage/pandas that referenced this issue Oct 30, 2020
Change catch types to reflect error changes

closes pandas-dev#37465
bashtage added a commit to bashtage/pandas that referenced this issue Oct 31, 2020
Change catch types to reflect error changes

closes pandas-dev#37465
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants