Skip to content

BUG: add_categories raises if the new category is included in the old #45638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
douglas-raillard-arm opened this issue Jan 26, 2022 · 2 comments
Closed
3 tasks done
Labels
Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action

Comments

@douglas-raillard-arm
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series(['a', 'b'], dtype='category')
print(s)

s.cat = s.cat.add_categories(['a'])

Issue Description

This code will raise

ValueError: new categories must not include old categories: {'a'}

Building reliable code on top of this primitive is made unnecessarily harder by this exception, as code that "seems to work" will stop working as soon as for one reason or another, the item being passed happens to already be in the category. The initial use case I had was to simply apply a fillna() on a Series for the cases where there is actually some NA values (not always):

import pandas as pd

default = 'b'
s = pd.Series(['a', None], dtype='category')
s.fillna(default)

This failed with:

TypeError: Cannot setitem on a Categorical with a new category (b), set the categories first

This is not very polymorphic-friendly as it leaks the fact that it's a categorical even though the only reason I used that is to save memory, but I can live with that.

Onto patching the category:

default = 'b'
s = pd.Series(['a', None], dtype='category')
s = s.cat.add_categories([default])
s.fillna(default)

Now the code seems to work. Only that it contains a landmine ready to blow as soon as the user would naively provide a default of 'a':

ValueError: new categories must not include old categories: {'a'}

Expected Behavior

add_categories() should be able Just Work ™ to allow building reliable libraries on top of pandas.
One might object that:

  • I should not blindly use fillna(). Yes, I could check s.isna().any(), but it would duplicate the NA detection which can be costly on large dataframes, as well as being unnecessary cruft.

  • I can work around by checking if the value is in the category. Yes I can, and I already have a module dedicated to combinators or essentially replacement of pandas functions that are not reliable and need extra care, and generally speaking functions that are dependently typed (e.g. DataFrame.groupby where the return type is T or tuple(T) depending on len(by) leading to similar bugs where an innocuous change in input can wreck havoc on the helper). The smallest this module is, the better.

  • I should read the documentation. Yes this behavior is documented, but this is an orthogonal concern to the other points.

On the bright side of things:

  • not raising an exception would probably be considered backward-compatible (not strictly speaking but it's unlikely someone really relied on that)
  • The change is actually trivial:
    Current code:
    def add_categories(self, new_categories, inplace=no_default):
        ...
        if len(already_included) != 0:
            raise ValueError(
                f"new categories must not include old categories: {already_included}"
            )
        new_categories = list(self.dtype.categories) + list(new_categories)

New code:

    def add_categories(self, new_categories, inplace=no_default):
        ...
        old = set(self.dtype.categories)
        new_categories = list(self.dtype.categories) + list(x for x in new_categories if x not in old)

Installed Versions

Replace this line with the output of pd.show_versions() Traceback (most recent call last): File "testcat.py", line 3, in pd.show_versions() File ".venv-3.10/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 109, in show_versions deps = _get_dependency_info() File ".venv-3.10/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 88, in _get_dependency_info mod = import_optional_dependency(modname, errors="ignore") File ".venv-3.10/lib/python3.10/site-packages/pandas/compat/_optional.py", line 126, in import_optional_dependency module = importlib.import_module(name) File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File ".venv-3.10/lib/python3.10/site-packages/setuptools/__init__.py", line 8, in import _distutils_hack.override # noqa: F401 File ".venv-3.10/lib/python3.10/site-packages/_distutils_hack/override.py", line 1, in __import__('_distutils_hack').do_override() File ".venv-3.10/lib/python3.10/site-packages/_distutils_hack/__init__.py", line 71, in do_override ensure_local_distutils() File ".venv-3.10/lib/python3.10/site-packages/_distutils_hack/__init__.py", line 59, in ensure_local_distutils assert '_distutils' in core.__file__, core.__file__ AssertionError: /usr/lib/python3.10/distutils/core.py

Pandas v1.4.0
Python 3.10

@douglas-raillard-arm douglas-raillard-arm added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022
@mroeschke
Copy link
Member

Given that this behavior has been documented and tested for at least 2+ years (#30242), this would be an API change to not raise anymore.

@mroeschke mroeschke added Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 10, 2022
@mroeschke
Copy link
Member

Thanks for the report, but it appears there hasn't been much interest in changing this in years so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants