-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/ENH: union Categorical #13361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: union Categorical #13361
Changes from 3 commits
ccaeb76
7b37c34
77e7963
4499cda
17209f9
568784f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3943,6 +3943,38 @@ def f(): | |
'category', categories=list('cab'))}) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_union(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does that belong to the concat tests or the categoricals? The API is in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah it could go there |
||
from pandas.types.concat import union_categoricals | ||
|
||
s = Categorical(list('abc')) | ||
s2 = Categorical(list('abd')) | ||
result = union_categoricals([s, s2]) | ||
expected = Categorical(list('abcabd')) | ||
tm.assert_categorical_equal(result, expected, ignore_order=True) | ||
|
||
s = Categorical([0,1,2]) | ||
s2 = Categorical([2,3,4]) | ||
result = union_categoricals([s, s2]) | ||
expected = Categorical([0,1,2,2,3,4]) | ||
tm.assert_categorical_equal(result, expected, ignore_order=True) | ||
|
||
s = Categorical([0,1.2,2]) | ||
s2 = Categorical([2,3.4,4]) | ||
result = union_categoricals([s, s2]) | ||
expected = Categorical([0,1.2,2,2,3.4,4]) | ||
tm.assert_categorical_equal(result, expected, ignore_order=True) | ||
|
||
# can't be ordered | ||
s = Categorical([0,1.2,2], ordered=True) | ||
with tm.assertRaises(TypeError): | ||
union_categoricals([s, s2]) | ||
|
||
# must exactly match types | ||
s = Categorical([0,1.2,2]) | ||
s2 = Categorical([2,3,4]) | ||
with tm.assertRaises(TypeError): | ||
union_categoricals([s, s2]) | ||
|
||
def test_categorical_index_preserver(self): | ||
|
||
a = Series(np.arange(6, dtype='int64')) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -201,6 +201,46 @@ def convert_categorical(x): | |
return Categorical(concatted, rawcats) | ||
|
||
|
||
def union_categoricals(to_union): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a if not ignore_order and any(c.ordered for c in to_union):
raise TypeError("Can only combine unordered Categoricals") It would still return a unordered cat, of course. |
||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded tag |
||
Combine list-like of Categoricals, unioning categories. All | ||
must have the same dtype, and none can be ordered. | ||
|
||
Parameters | ||
---------- | ||
to_union : list like of Categorical | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add Raises (and list when that happens) |
||
Returns | ||
------- | ||
Categorical | ||
A single array, categories will be ordered as they | ||
appear in the list | ||
""" | ||
from pandas import Index, Categorical | ||
|
||
if any(c.ordered for c in to_union): | ||
raise TypeError("Can only combine unordered Categoricals") | ||
|
||
first = to_union[0] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should graceful catch the condition that the list of categoricals is empty. |
||
if not all(com.is_dtype_equal(c.categories, first.categories) | ||
for c in to_union): | ||
raise TypeError("dtype of categories must be the same") | ||
|
||
for i, c in enumerate(to_union): | ||
if i == 0: | ||
cats = c.categories.tolist() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use numpy arrays (or pandas indexes) and concatenate them all at once rather than converting things into lists. Something like:
In the worst case, suppose you have My version takes linear time (and space), whereas your version will require quadratic time -- There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you soln won't preseve order, though with a small mod I think it could, need to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. though to be honest I can't imagine this is a perf issue. The entire point of this is that categories is << codes. But I suppose if it can be done with no additional complexity then it should. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @shoyer. I didn't know In [12]: pd.Index(['b','c']).append(pd.Index(['a'])).unique()
Out[12]: array(['b', 'c', 'a'], dtype=object) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chris-b1 yes it IS a guarantee (and that's why its much faster than |
||
else: | ||
cats = cats + c.categories.difference(Index(cats)).tolist() | ||
|
||
cats = Index(cats) | ||
new_codes = [] | ||
for c in to_union: | ||
indexer = cats.get_indexer(c.categories) | ||
new_codes.append(indexer.take(c.codes)) | ||
codes = np.concatenate(new_codes) | ||
return Categorical.from_codes(codes, cats) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This still does a validation, maybe change to use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we should add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically you would just end up doing two |
||
|
||
|
||
def _concat_datetime(to_concat, axis=0, typs=None): | ||
""" | ||
provide concatenation of an datetimelike array of arrays each of which is a | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -963,12 +963,17 @@ def assertNotIsInstance(obj, cls, msg=''): | |
|
||
|
||
def assert_categorical_equal(left, right, check_dtype=True, | ||
obj='Categorical'): | ||
obj='Categorical', ignore_order=False): | ||
assertIsInstance(left, pd.Categorical, '[Categorical] ') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a doc-string |
||
assertIsInstance(right, pd.Categorical, '[Categorical] ') | ||
|
||
assert_index_equal(left.categories, right.categories, | ||
obj='{0}.categories'.format(obj)) | ||
if ignore_order: | ||
assert_index_equal(left.categories.sort_values(), | ||
right.categories.sort_values(), | ||
obj='{0}.categories'.format(obj)) | ||
else: | ||
assert_index_equal(left.categories, right.categories, | ||
obj='{0}.categories'.format(obj)) | ||
assert_numpy_array_equal(left.codes, right.codes, check_dtype=check_dtype, | ||
obj='{0}.codes'.format(obj)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag here