BUG: add_categories raises if the new category is included in the old #45638
Labels
Categorical
Categorical Data Type
Needs Discussion
Requires discussion from core team before further action
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
This code will raise
Building reliable code on top of this primitive is made unnecessarily harder by this exception, as code that "seems to work" will stop working as soon as for one reason or another, the item being passed happens to already be in the category. The initial use case I had was to simply apply a fillna() on a Series for the cases where there is actually some NA values (not always):
This failed with:
This is not very polymorphic-friendly as it leaks the fact that it's a categorical even though the only reason I used that is to save memory, but I can live with that.
Onto patching the category:
Now the code seems to work. Only that it contains a landmine ready to blow as soon as the user would naively provide a default of
'a'
:Expected Behavior
add_categories()
should be able Just Work ™ to allow building reliable libraries on top of pandas.One might object that:
I should not blindly use
fillna()
. Yes, I could checks.isna().any()
, but it would duplicate the NA detection which can be costly on large dataframes, as well as being unnecessary cruft.I can work around by checking if the value is in the category. Yes I can, and I already have a module dedicated to combinators or essentially replacement of pandas functions that are not reliable and need extra care, and generally speaking functions that are dependently typed (e.g.
DataFrame.groupby
where the return type isT
ortuple(T)
depending onlen(by)
leading to similar bugs where an innocuous change in input can wreck havoc on the helper). The smallest this module is, the better.I should read the documentation. Yes this behavior is documented, but this is an orthogonal concern to the other points.
On the bright side of things:
Current code:
New code:
Installed Versions
Pandas v1.4.0
Python 3.10
The text was updated successfully, but these errors were encountered: