Skip to content

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
4487bec
Add fix to raise error when category value is not predefined
chrispe May 5, 2020
10098ab
Fix linting
chrispe May 5, 2020
cb34580
Added new test
chrispe May 5, 2020
1622663
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe May 16, 2020
c627fa6
Add test case for unused categories
chrispe May 16, 2020
ba3a751
Remove trailing whitespace
chrispe May 16, 2020
51dcdfe
Fix linting
chrispe May 16, 2020
9057b26
Fix linting
chrispe May 16, 2020
06fdc3e
Remove temporary fix from generic.py
chrispe May 23, 2020
8c8f794
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe May 23, 2020
582c023
First fix try through indexing.py
chrispe May 24, 2020
730fc2b
Fix lint
chrispe May 24, 2020
c275eb9
Fix import ordering
chrispe May 24, 2020
944ae24
Fix Update
chrispe May 24, 2020
8372bdb
Fix lint
chrispe May 24, 2020
0e5e418
Include more related test cases
chrispe May 24, 2020
eea359a
Fix linting
chrispe May 24, 2020
5f72d4e
Update test_indexing.py
chrispe May 24, 2020
781322a
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 8, 2020
26f474b
import missing dtypes function
chrispe Oct 8, 2020
215943e
Fix linting
chrispe Oct 8, 2020
5bacde9
Include requested changes
chrispe Oct 10, 2020
96e4318
Fix import ordering/format
chrispe Oct 10, 2020
993be66
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 10, 2020
42d968b
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 11, 2020
e7ce246
Update test_categorical.py
chrispe Oct 11, 2020
31ef609
Fix format
chrispe Oct 11, 2020
ce3f463
Remove commas
chrispe Oct 11, 2020
a825269
Update test_categorical.py
chrispe Oct 11, 2020
3816789
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 12, 2020
8651d25
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 17, 2020
72726a0
Update solution
chrispe Oct 17, 2020
0f738c3
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 17, 2020
51e2032
Fix lint
chrispe Oct 17, 2020
7d64357
Fix format issues
chrispe Oct 17, 2020
d68f215
Update indexing.py
chrispe Oct 17, 2020
5ea8ab1
Update indexing.py
chrispe Oct 18, 2020
b08efc1
Update indexing.py
chrispe Oct 18, 2020
69f4e62
Update test_categorical.py
chrispe Oct 18, 2020
4c33040
Update concat.py
chrispe Oct 18, 2020
e936736
Update cast.py
chrispe Oct 18, 2020
8031f8f
Update cast.py
chrispe Oct 18, 2020
1e1c094
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Oct 18, 2020
c889b1b
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Nov 4, 2020
c08c6c0
Update test_categorical.py
chrispe Nov 4, 2020
e6c3a4c
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Nov 28, 2020
c862d99
Revert previous approach and include concat changes
chrispe Nov 28, 2020
5baa314
Remove non-required convertion
chrispe Nov 28, 2020
cb5d8e4
Update concat.py
chrispe Nov 28, 2020
7d7da20
Update concat.py
chrispe Nov 28, 2020
ecad50f
Update cast.py
chrispe Dec 6, 2020
ab5af93
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Dec 6, 2020
0c6b68a
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Jan 21, 2021
4197a74
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 13, 2021
4bc05c6
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 15, 2021
af5e141
Add new version with raise
chrispe Feb 15, 2021
6d45570
Add format fixes
chrispe Feb 15, 2021
31612ed
Update test_categorical.py
chrispe Feb 15, 2021
4a2a8e8
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 17, 2021
dd7e3ca
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Feb 20, 2021
6d9e667
Update
chrispe Feb 20, 2021
e0da655
Use prio_cat_dtype only for EAs
chrispe Feb 20, 2021
92d1f14
Revert usage of first_ea
chrispe Feb 20, 2021
9b9b382
Fix mypy errors
chrispe Feb 20, 2021
d3df994
Use unique1d in _cast_to_common_type
chrispe Feb 20, 2021
41aa9e3
Fix isort error
chrispe Feb 20, 2021
ca0eb1f
Renamed input variable for find_common_type
chrispe Feb 20, 2021
e2cfb79
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Mar 7, 2021
931d6c8
Remove new argument in find_common_type
chrispe Mar 7, 2021
8065ddb
Add check to _get_common_dtype
chrispe Mar 13, 2021
5d533dd
Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…
chrispe Mar 13, 2021
b21326b
Update dtypes.py
chrispe Mar 13, 2021
335fc06
Update dtypes.py
chrispe Mar 13, 2021
950dcc4
Update dtypes.py
chrispe Mar 13, 2021
2ee1df8
Update dtypes.py
chrispe Mar 13, 2021
17120f0
Test
chrispe Mar 13, 2021
439b49f
Add flag in get_common_type
chrispe Mar 13, 2021
c6e3435
Revert
chrispe Mar 13, 2021
fc40817
Update dtypes.py
chrispe Mar 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 76 additions & 1 deletion pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
is_sparse,
)
from pandas.core.dtypes.generic import ABCCategoricalIndex, ABCSeries
from pandas.core.dtypes.missing import isna

from pandas.core.arrays import ExtensionArray
from pandas.core.arrays.sparse import SparseArray
Expand Down Expand Up @@ -61,6 +62,70 @@ def _cast_to_common_type(arr: ArrayLike, dtype: DtypeObj) -> ArrayLike:
return arr.astype(dtype, copy=False)


def _can_cast_to_categorical(to_cast):
"""
Evaluates if a list of arrays can be casted to a single categorical dtype.
The categorical dtype to cast to, is determined by any of the arrays which
is already of categorical dtype. If no such array exists, or if the existing
categorical dtype does not contain any of the unique values of the other arrays,
then it will return False.

Parameters
----------
to_cast : array of arrays

Returns
-------
True if possible to cast to a single categorical dtype, False otherwise.
"""
if len(to_cast) == 0:
raise ValueError("No arrays to cast")

types = [x.dtype for x in to_cast]

# If any of the arrays is of categorical dtype, then we will use it as a reference.
# If no such array exists, then we just return.
if any(is_categorical_dtype(t) for t in types):
cat_dtypes = []
for t in types:
if (
is_categorical_dtype(t)
and len(t.categories.values) > 0
and any(~isna(t.categories.values))
):
categorical_values_dtype = t.categories.values.dtype
if all(
is_categorical_dtype(x) or np.can_cast(categorical_values_dtype, x)
for x in types
):
cat_dtypes.append(t)
if len(cat_dtypes) == 0 or any(
not is_dtype_equal(dtype, cat_dtypes[0]) for dtype in cat_dtypes[1:]
):
return False
else:
return False

def categorical_contains_values(categorical_dtype, x):
unique_values = np.unique(x[~isna(x)])
if any(
val not in categorical_dtype.categories for val in unique_values.tolist()
):
return False
return True

if not all(
categorical_contains_values(to_cast[0].dtype, other) or len(other) == 0
for other in to_cast[1:]
):
raise ValueError(
"Cannot concat on a Categorical with a new category, "
"set the categories first"
)

return True


def concat_compat(to_concat, axis: int = 0, ea_compat_axis: bool = False):
"""
provide concatenation of an array of arrays each of which is a single
Expand Down Expand Up @@ -108,7 +173,17 @@ def is_nonempty(x) -> bool:
# we ignore axis here, as internally concatting with EAs is always
# for axis=0
if not single_dtype:
target_dtype = find_common_type([x.dtype for x in to_concat])
# Special case for handling concat with categorical series.
# We need to make sure that categorical dtype is preserved
# when an array of valid values is given (GH#25383)
if (
isinstance(to_concat[0], ExtensionArray)
and all(x.shape[0] == 1 for x in to_concat[1:])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a better way to detect when the concat_compat function is called through index expansion, so in cases like this:

ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "c"

With the latest commit we are raising a ValueError when an invalid value is added to the categorical through index expansion. it also enables the index expansion of a categorical of any dtype.

and _can_cast_to_categorical(to_concat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this a very complicated implementation. This should all be in `find_common_type`` , but should be much simpler that this. either the dtypes are the same or they are not. changing them is not in scope for this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, scope is more clear to me now. I will revert back to the previous approach and adapt it to raise on dtype mismatch.

):
target_dtype = to_concat[0].dtype
else:
target_dtype = find_common_type([x.dtype for x in to_concat])
to_concat = [_cast_to_common_type(arr, target_dtype) for arr in to_concat]

if isinstance(to_concat[0], ExtensionArray):
Expand Down
116 changes: 116 additions & 0 deletions pandas/tests/series/test_categorical.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
import pytest

from pandas.core.dtypes.concat import _can_cast_to_categorical

import pandas as pd
from pandas import Categorical
import pandas._testing as tm


class TestCategoricalSeries:
def test_setitem_undefined_category_raises(self):
ser = pd.Series(Categorical(["a", "b", "c"]))
msg = (
"Cannot setitem on a Categorical with a new category, "
"set the categories first"
)
with pytest.raises(ValueError, match=msg):
ser.loc[2] = "d"

def test_concat_undefined_category_raises(self):
ser = pd.Series(Categorical(["a", "b", "c"]))
msg = (
"Cannot concat on a Categorical with a new category, "
"set the categories first"
)
with pytest.raises(ValueError, match=msg):
ser.loc[3] = "d"

def test_loc_category_dtype_retention(self):
# Case 1
df = pd.DataFrame(
{
"int": [0, 1, 2],
"cat": Categorical(["a", "b", "c"], categories=["a", "b", "c"]),
}
)
df.loc[3] = [3, "c"]
expected = pd.DataFrame(
{
"int": [0, 1, 2, 3],
"cat": Categorical(["a", "b", "c", "c"], categories=["a", "b", "c"]),
}
)
tm.assert_frame_equal(df, expected)

# Case 2
ser = pd.Series(Categorical(["a", "b", "c"]))
ser.loc[3] = "c"
expected = pd.Series(Categorical(["a", "b", "c", "c"]))
tm.assert_series_equal(ser, expected)

# Case 3
ser = pd.Series(Categorical([1, 2, 3]))
ser.loc[3] = 3
expected = pd.Series(Categorical([1, 2, 3, 3]))
tm.assert_series_equal(ser, expected)

# Case 4
ser = pd.Series(Categorical([1, 2, 3]))
ser.loc[3] = pd.NA
expected = pd.Series(Categorical([1, 2, 3, pd.NA]))
tm.assert_series_equal(ser, expected)

def test_can_cast_to_categorical(self):
# Case 1:
# Series of identical categorical dtype should
# be able to concat to categorical
ser1 = pd.Series(Categorical(["a", "b", "c"]))
ser2 = pd.Series(Categorical(["a", "b", "c"]))
arr = [ser1, ser2]
assert _can_cast_to_categorical(arr) is True

# Case 2:
# Series of non-identical categorical dtype should
# not be able to concat to categoorical
ser1 = pd.Series(Categorical(["a", "b", "c"]))
ser2 = pd.Series(Categorical(["a", "b", "d"]))
arr = [ser1, ser2]
assert _can_cast_to_categorical(arr) is False

# Concat of a categorical series with a series
# containing only values identical to the
# categorical values should be possible

# Case 3: For string categorical values
ser1 = pd.Series(Categorical(["a", "b", "c"]))
ser2 = pd.Series(["a", "a", "b"])
arr = [ser1, ser2]
assert _can_cast_to_categorical(arr) is True

# Case 4: For int categorical values
ser1 = pd.Series(Categorical([1, 2, 3]))
ser2 = pd.Series([1, 2])
arr = [ser1, ser2]
assert _can_cast_to_categorical(arr) is True

# The rest should raise because not all values
# are present in the categorical.

# Case 5
ser1 = pd.Series(Categorical([1, 2, 3]))
ser2 = pd.Series([3, 4])
arr = [ser1, ser2]
msg = (
"Cannot concat on a Categorical with a new category, "
"set the categories first"
)
with pytest.raises(ValueError, match=msg):
_can_cast_to_categorical(arr)

# Case 6
ser1 = pd.Series(Categorical(["a", "b", "c"]))
ser2 = pd.Series(["d", "e"])
arr = [ser1, ser2]
with pytest.raises(ValueError, match=msg):
_can_cast_to_categorical(arr)