Skip to content

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
4487bec
Add fix to raise error when category value is not predefined
chrispe May 5, 2020
10098ab
Fix linting
chrispe May 5, 2020
cb34580
Added new test
chrispe May 5, 2020
1622663
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe May 16, 2020
c627fa6
Add test case for unused categories
chrispe May 16, 2020
ba3a751
Remove trailing whitespace
chrispe May 16, 2020
51dcdfe
Fix linting
chrispe May 16, 2020
9057b26
Fix linting
chrispe May 16, 2020
06fdc3e
Remove temporary fix from generic.py
chrispe May 23, 2020
8c8f794
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe May 23, 2020
582c023
First fix try through indexing.py
chrispe May 24, 2020
730fc2b
Fix lint
chrispe May 24, 2020
c275eb9
Fix import ordering
chrispe May 24, 2020
944ae24
Fix Update
chrispe May 24, 2020
8372bdb
Fix lint
chrispe May 24, 2020
0e5e418
Include more related test cases
chrispe May 24, 2020
eea359a
Fix linting
chrispe May 24, 2020
5f72d4e
Update test_indexing.py
chrispe May 24, 2020
781322a
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 8, 2020
26f474b
import missing dtypes function
chrispe Oct 8, 2020
215943e
Fix linting
chrispe Oct 8, 2020
5bacde9
Include requested changes
chrispe Oct 10, 2020
96e4318
Fix import ordering/format
chrispe Oct 10, 2020
993be66
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 10, 2020
42d968b
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 11, 2020
e7ce246
Update test_categorical.py
chrispe Oct 11, 2020
31ef609
Fix format
chrispe Oct 11, 2020
ce3f463
Remove commas
chrispe Oct 11, 2020
a825269
Update test_categorical.py
chrispe Oct 11, 2020
3816789
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 12, 2020
8651d25
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 17, 2020
72726a0
Update solution
chrispe Oct 17, 2020
0f738c3
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 17, 2020
51e2032
Fix lint
chrispe Oct 17, 2020
7d64357
Fix format issues
chrispe Oct 17, 2020
d68f215
Update indexing.py
chrispe Oct 17, 2020
5ea8ab1
Update indexing.py
chrispe Oct 18, 2020
b08efc1
Update indexing.py
chrispe Oct 18, 2020
69f4e62
Update test_categorical.py
chrispe Oct 18, 2020
4c33040
Update concat.py
chrispe Oct 18, 2020
e936736
Update cast.py
chrispe Oct 18, 2020
8031f8f
Update cast.py
chrispe Oct 18, 2020
1e1c094
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 18, 2020
c889b1b
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Nov 4, 2020
c08c6c0
Update test_categorical.py
chrispe Nov 4, 2020
e6c3a4c
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Nov 28, 2020
c862d99
Revert previous approach and include concat changes
chrispe Nov 28, 2020
5baa314
Remove non-required convertion
chrispe Nov 28, 2020
cb5d8e4
Update concat.py
chrispe Nov 28, 2020
7d7da20
Update concat.py
chrispe Nov 28, 2020
ecad50f
Update cast.py
chrispe Dec 6, 2020
ab5af93
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Dec 6, 2020
0c6b68a
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Jan 21, 2021
4197a74
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 13, 2021
4bc05c6
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 15, 2021
af5e141
Add new version with raise
chrispe Feb 15, 2021
6d45570
Add format fixes
chrispe Feb 15, 2021
31612ed
Update test_categorical.py
chrispe Feb 15, 2021
4a2a8e8
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 17, 2021
dd7e3ca
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 20, 2021
6d9e667
Update
chrispe Feb 20, 2021
e0da655
Use prio_cat_dtype only for EAs
chrispe Feb 20, 2021
92d1f14
Revert usage of first_ea
chrispe Feb 20, 2021
9b9b382
Fix mypy errors
chrispe Feb 20, 2021
d3df994
Use unique1d in _cast_to_common_type
chrispe Feb 20, 2021
41aa9e3
Fix isort error
chrispe Feb 20, 2021
ca0eb1f
Renamed input variable for find_common_type
chrispe Feb 20, 2021
e2cfb79
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Mar 7, 2021
931d6c8
Remove new argument in find_common_type
chrispe Mar 7, 2021
8065ddb
Add check to _get_common_dtype
chrispe Mar 13, 2021
5d533dd
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Mar 13, 2021
b21326b
Update dtypes.py
chrispe Mar 13, 2021
335fc06
Update dtypes.py
chrispe Mar 13, 2021
950dcc4
Update dtypes.py
chrispe Mar 13, 2021
2ee1df8
Update dtypes.py
chrispe Mar 13, 2021
17120f0
Test
chrispe Mar 13, 2021
439b49f
Add flag in get_common_type
chrispe Mar 13, 2021
c6e3435
Revert
chrispe Mar 13, 2021
fc40817
Update dtypes.py
chrispe Mar 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 29 additions & 3 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@
from pandas.errors import AbstractMethodError, InvalidIndexError
from pandas.util._decorators import doc

from pandas.core.dtypes.cast import find_common_type
from pandas.core.dtypes.common import (
is_array_like,
is_categorical_dtype,
is_hashable,
is_integer,
is_iterator,
Expand Down Expand Up @@ -1842,7 +1844,14 @@ def _setitem_with_indexer_missing(self, indexer, value):
"""
Insert new row(s) or column(s) into the Series or DataFrame.
"""
from pandas import Series
from pandas import DataFrame, Series

def check_valid_categorical(new_values, obj_dtype):
if is_categorical_dtype(obj_dtype):
if (~np.in1d(new_values, obj_dtype.categories.values)).any():
raise ValueError(
"Cannot setitem on a Categorical with a new category"
)

# reindex the axis to the new value
# and set inplace
Expand All @@ -1867,8 +1876,16 @@ def _setitem_with_indexer_missing(self, indexer, value):
# GH#22717 handle casting compatibility that np.concatenate
# does incorrectly
new_values = concat_compat([self.obj._values, new_values])
if is_object_dtype(new_values.dtype):
dtype = None
else:
dtype = find_common_type([self.obj.dtype, new_values.dtype])
else:
dtype = None

check_valid_categorical(new_values, self.obj.dtype)
self.obj._mgr = self.obj._constructor(
new_values, index=new_index, name=self.obj.name
new_values, index=new_index, name=self.obj.name, dtype=dtype
)._mgr
self.obj._maybe_update_cacher(clear=True)

Expand All @@ -1893,7 +1910,16 @@ def _setitem_with_indexer_missing(self, indexer, value):
if len(value) != len(self.obj.columns):
raise ValueError("cannot set a row with mismatched columns")

value = Series(value, index=self.obj.columns, name=indexer)
if len(set(self.obj.dtypes)) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire implementation should not be here, rather in Categorical indexing itself. Indexing is already very complicated and we are trying to remove things from core/indexing.py; this is dtype specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls address this comment

Copy link
Contributor Author

@chrispe chrispe Oct 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the dtype specific code from core/indexing.py. However, some code changes still remain there, because I couldn't find an alternative way. The latest changes address the issue and so the Categorical dtype is now preserved:

import pandas as pd

df = pd.DataFrame.from_dict({'reg': [0,1,2], 'cat':pd.Categorical(['a','b','b'], categories=['a','b','c','d'])})
print(df.dtypes)  # reg is int64, cat is categorical

df.loc[3] = (3, 'c')  # add row with categorical value that exist in categories
print(df.dtypes)  # reg is int64, cat is now still categorical (which is not the case in master)

Some additional points related to this PR changes:

  • When an unknown category is appended through index expansion in a pd.DataFrame then that value is replaced with nan.
In [1]: import pandas as pd
   ...: from pandas import Categorical
   ...: df = pd.DataFrame({"int": [0, 1], "cat": pd.Categorical(["a", "b"], categories=["a", "b"])})
   ...: df.loc[3] = [3.3, "d"]
   ...: df
Out[1]:
   int  cat
0  0.0    a
1  1.0    b
3  3.3  NaN
In [2]: df.dtypes                                                               
Out[2]: 
int     float64
cat    category
dtype: object

In [3]: df.dtypes[1]                                                            
Out[3]: CategoricalDtype(categories=['a', 'b'], ordered=False)
  • The same issue is not addressed for pd.Series in this PR. Which means that the dtype of the series changes to object when an unseen category is appended through index expansion.
In [1]: import pandas as pd
   ...: from pandas import Categorical
   ...: ser = pd.Series(Categorical(["a", "b", "c"]))
   ...: ser.loc[3] = "d"
   ...: ser
Out[1]: 
0    a
1    b
2    c
3    d
dtype: object

value = list(value)
for i in range(len(self.obj.columns)):
value[i] = Series(data=[value[i]], dtype=self.obj.dtypes[i])
check_valid_categorical(value[i], self.obj.dtypes[i])
value = dict(zip(self.obj.columns, value))
value = DataFrame(value)
value.index = [indexer]
else:
value = Series(value, index=self.obj.columns, name=indexer)

self.obj._mgr = self.obj.append(value)._mgr
self.obj._maybe_update_cacher(clear=True)
Expand Down
54 changes: 54 additions & 0 deletions pandas/tests/series/test_categorical.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import pytest

import pandas as pd
from pandas import Categorical, Index
import pandas._testing as tm


class TestCategoricalSeries:
def test_loc_new_category_series_raises(self):
ser = pd.Series(Categorical(["a", "b", "c"]))
msg = "Cannot setitem on a Categorical with a new category"
with pytest.raises(ValueError, match=msg):
ser.loc[3] = "d"

def test_unused_category_retention(self):
# Init case
exp_cats = Index(["a", "b", "c", "d"])
ser = pd.Series(Categorical(["a", "b", "c"], categories=exp_cats))
tm.assert_index_equal(ser.cat.categories, exp_cats)

# Modify case
ser.loc[0] = "b"
expected = pd.Series(Categorical(["b", "b", "c"], categories=exp_cats))
tm.assert_index_equal(ser.cat.categories, exp_cats)
tm.assert_series_equal(ser, expected)

def test_loc_new_category_row_raises(self):
df = pd.DataFrame(
{
"int": [0, 1, 2],
"cat": Categorical(["a", "b", "c"], categories=["a", "b", "c"]),
}
)
msg = "Cannot setitem on a Categorical with a new category"
with pytest.raises(ValueError, match=msg):
df.loc[3] = [3, "d"]

def test_loc_new_row_category_dtype_retention(self):
df = pd.DataFrame(
{
"int": [0, 1, 2],
"cat": pd.Categorical(["a", "b", "c"], categories=["a", "b", "c"]),
}
)
df.loc[3] = [3, "c"]

expected = pd.DataFrame(
{
"int": [0, 1, 2, 3],
"cat": pd.Categorical(["a", "b", "c", "c"], categories=["a", "b", "c"]),
}
)

tm.assert_frame_equal(df, expected)