BUG: add_categories raises if the new category is included in the old #45638

douglas-raillard-arm · 2022-01-26T15:55:04Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series(['a', 'b'], dtype='category')
print(s)

s.cat = s.cat.add_categories(['a'])

Issue Description

This code will raise

ValueError: new categories must not include old categories: {'a'}

Building reliable code on top of this primitive is made unnecessarily harder by this exception, as code that "seems to work" will stop working as soon as for one reason or another, the item being passed happens to already be in the category. The initial use case I had was to simply apply a fillna() on a Series for the cases where there is actually some NA values (not always):

import pandas as pd

default = 'b'
s = pd.Series(['a', None], dtype='category')
s.fillna(default)

This failed with:

TypeError: Cannot setitem on a Categorical with a new category (b), set the categories first

This is not very polymorphic-friendly as it leaks the fact that it's a categorical even though the only reason I used that is to save memory, but I can live with that.

Onto patching the category:

default = 'b'
s = pd.Series(['a', None], dtype='category')
s = s.cat.add_categories([default])
s.fillna(default)

Now the code seems to work. Only that it contains a landmine ready to blow as soon as the user would naively provide a default of 'a':

ValueError: new categories must not include old categories: {'a'}

Expected Behavior

add_categories() should be able Just Work ™ to allow building reliable libraries on top of pandas.
One might object that:

I should not blindly use fillna(). Yes, I could check s.isna().any(), but it would duplicate the NA detection which can be costly on large dataframes, as well as being unnecessary cruft.
I can work around by checking if the value is in the category. Yes I can, and I already have a module dedicated to combinators or essentially replacement of pandas functions that are not reliable and need extra care, and generally speaking functions that are dependently typed (e.g. DataFrame.groupby where the return type is T or tuple(T) depending on len(by) leading to similar bugs where an innocuous change in input can wreck havoc on the helper). The smallest this module is, the better.
I should read the documentation. Yes this behavior is documented, but this is an orthogonal concern to the other points.

On the bright side of things:

not raising an exception would probably be considered backward-compatible (not strictly speaking but it's unlikely someone really relied on that)
The change is actually trivial:
Current code:

    def add_categories(self, new_categories, inplace=no_default):
        ...
        if len(already_included) != 0:
            raise ValueError(
                f"new categories must not include old categories: {already_included}"
            )
        new_categories = list(self.dtype.categories) + list(new_categories)

New code:

    def add_categories(self, new_categories, inplace=no_default):
        ...
        old = set(self.dtype.categories)
        new_categories = list(self.dtype.categories) + list(x for x in new_categories if x not in old)

Installed Versions

Replace this line with the output of pd.show_versions() Traceback (most recent call last): File "testcat.py", line 3, in pd.show_versions() File ".venv-3.10/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 109, in show_versions deps = _get_dependency_info() File ".venv-3.10/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 88, in _get_dependency_info mod = import_optional_dependency(modname, errors="ignore") File ".venv-3.10/lib/python3.10/site-packages/pandas/compat/_optional.py", line 126, in import_optional_dependency module = importlib.import_module(name) File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File ".venv-3.10/lib/python3.10/site-packages/setuptools/__init__.py", line 8, in import _distutils_hack.override # noqa: F401 File ".venv-3.10/lib/python3.10/site-packages/_distutils_hack/override.py", line 1, in __import__('_distutils_hack').do_override() File ".venv-3.10/lib/python3.10/site-packages/_distutils_hack/__init__.py", line 71, in do_override ensure_local_distutils() File ".venv-3.10/lib/python3.10/site-packages/_distutils_hack/__init__.py", line 59, in ensure_local_distutils assert '_distutils' in core.__file__, core.__file__ AssertionError: /usr/lib/python3.10/distutils/core.py

Pandas v1.4.0
Python 3.10

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-02-10T23:20:10Z

Given that this behavior has been documented and tested for at least 2+ years (#30242), this would be an API change to not raise anymore.

mroeschke · 2024-09-04T16:56:11Z

Thanks for the report, but it appears there hasn't been much interest in changing this in years so closing

douglas-raillard-arm added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022

mroeschke added Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2022

mroeschke closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: add_categories raises if the new category is included in the old #45638

BUG: add_categories raises if the new category is included in the old #45638

douglas-raillard-arm commented Jan 26, 2022

mroeschke commented Feb 10, 2022

mroeschke commented Sep 4, 2024

BUG: add_categories raises if the new category is included in the old #45638

BUG: add_categories raises if the new category is included in the old #45638

Comments

douglas-raillard-arm commented Jan 26, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

mroeschke commented Feb 10, 2022

mroeschke commented Sep 4, 2024