Skip to content

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -908,7 +908,7 @@ Indexing
- Bug in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` not always propagating metadata (:issue:`28283`)
- Bug in :meth:`DataFrame.sum` min_count changes dtype if input contains NaNs (:issue:`46947`)
- Bug in :class:`IntervalTree` that lead to an infinite recursion. (:issue:`46658`)
-
- Bug in :meth:`DataFrame.loc` when creating a new element on a :class:`Series` with dtype :class:`CategoricalDtype` (:issue:`47677`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release note not needed if only a fix for issue on main. If changing behavior from 1.4.3, as suggested in #47751 (comment) then will need a release note covering just that change.


Missing
^^^^^^^
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -646,6 +646,12 @@ def _maybe_promote(dtype: np.dtype, fill_value=np.nan):

return np.dtype("object"), fill_value

elif isinstance(dtype, CategoricalDtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this before the isna check and add the Categorical handling of nan values into this block?

Keeps the Categorical specific code together

Ohterwise this looks good

if fill_value in dtype.categories:
return "category", fill_value
else:
return object, ensure_object(fill_value)

elif is_float(fill_value):
if issubclass(dtype.type, np.bool_):
dtype = np.dtype(np.object_)
Expand Down
10 changes: 9 additions & 1 deletion pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
is_sequence,
)
from pandas.core.dtypes.concat import concat_compat
from pandas.core.dtypes.dtypes import CategoricalDtype
from pandas.core.dtypes.generic import (
ABCDataFrame,
ABCSeries,
Expand Down Expand Up @@ -2119,9 +2120,16 @@ def _setitem_with_indexer_missing(self, indexer, value):
new_values = Series([value], dtype=new_dtype)._values

if len(self.obj._values):
# GH#47677 handle enlarging with a scalar as a special case
Copy link
Member

@phofl phofl Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place for this. You have to create new_values correctly instead of avoiding concat_compat

Copy link
Member Author

@CloseChoice CloseChoice Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, here is the value of new_values

(Pdb++) new_values
['a']
Categories (1, object): ['a']

The point of avoiding concat_compat is, that in there we explicitly check if the dtypes are the same, and because the categories aren't the same the concatenated series is cast to object. If we'd change that then this will also have an effect on pd.concat([s, pd.Series(["a", "b"], dtype="category")]) or we create a very specific special case in concat_compat and check for length 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not what I was talking about. But lets start over:

If we want to handle this in maybe_promote, we should return the dtype we get as input, not the string "categories". This loses all information we already got. No need to special case afterwards then.

If we decide against handling this in maybe_promote, we have to handle this earlier, e.g. in line 2120 at the latest, not here. tolist() is never a good idea, if you want to keep the dtype, this looses all precision information, e.g.

result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3], dtype="Int64")))
result.loc[3] = 2

this is converted into CategoricalDtype with int64 not Int64, same would go for int32. It would get converted into int64 too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Of course you're right, just returning the given dtype works fine. I updated the PR, handling this in maybe_promote feels right for me, checking for dtypes and returning them is done there a lot.

if (
isinstance(self.obj.dtype, CategoricalDtype)
and new_dtype == "category"
):
new_values = Series(self.obj.tolist() + [value], dtype="category")
# GH#22717 handle casting compatibility that np.concatenate
# does incorrectly
new_values = concat_compat([self.obj._values, new_values])
else:
new_values = concat_compat([self.obj._values, new_values])
self.obj._mgr = self.obj._constructor(
new_values, index=new_index, name=self.obj.name
)._mgr
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/indexing/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1019,3 +1019,17 @@ def test_ser_list_indexer_exceeds_dimensions(indexer_li):
res = indexer_li(ser)[[0, 0]]
exp = Series([10, 10], index=Index([0, 0]))
tm.assert_series_equal(res, exp)


def test_additional_element_to_categorical_series_loc():
result = Series(["a", "b", "c"], dtype="category")
result.loc[3] = 0
expected = Series(["a", "b", "c", 0], dtype="object")
tm.assert_series_equal(result, expected)


def test_additional_categorical_element_loc():
result = Series(["a", "b", "c"], dtype="category")
result.loc[3] = "a"
expected = Series(["a", "b", "c", "a"], dtype="category")
tm.assert_series_equal(result, expected)