-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add sort_categories argument to union_categoricals #13846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
c559662
ENH: add sort_categories argument to union_categoricals
sinhrks eea1777
skip r-esort when possible on fastpath
chris-b1 ecb2ae9
more tests; handle sorth with ordered
chris-b1 ff0bb5e
add follow-up PRs to whatsnew
chris-b1 3a710f0
lint fix
chris-b1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -211,29 +211,31 @@ def convert_categorical(x): | |
return Categorical(concatted, rawcats) | ||
|
||
|
||
def union_categoricals(to_union): | ||
def union_categoricals(to_union, sort_categories=False): | ||
""" | ||
Combine list-like of Categoricals, unioning categories. All | ||
must have the same dtype, and none can be ordered. | ||
categories must have the same dtype. | ||
|
||
.. versionadded:: 0.19.0 | ||
|
||
Parameters | ||
---------- | ||
to_union : list-like of Categoricals | ||
sort_categories : boolean, default False | ||
If true, resulting categories will be lexsorted, otherwise | ||
they will be ordered as they appear in the data. | ||
|
||
Returns | ||
------- | ||
Categorical | ||
A single array, categories will be ordered as they | ||
appear in the list | ||
result : Categorical | ||
|
||
Raises | ||
------ | ||
TypeError | ||
- all inputs do not have the same dtype | ||
- all inputs do not have the same ordered property | ||
- all inputs are ordered and their categories are not identical | ||
- sort_categories=True and Categoricals are ordered | ||
ValueError | ||
Emmpty list of categoricals passed | ||
""" | ||
|
@@ -244,41 +246,51 @@ def union_categoricals(to_union): | |
|
||
first = to_union[0] | ||
|
||
if not all(is_dtype_equal(c.categories.dtype, first.categories.dtype) | ||
for c in to_union): | ||
if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype) | ||
for other in to_union[1:]): | ||
raise TypeError("dtype of categories must be the same") | ||
|
||
ordered = False | ||
if all(first.is_dtype_equal(other) for other in to_union[1:]): | ||
return Categorical(np.concatenate([c.codes for c in to_union]), | ||
categories=first.categories, ordered=first.ordered, | ||
fastpath=True) | ||
# identical categories - fastpath | ||
categories = first.categories | ||
ordered = first.ordered | ||
new_codes = np.concatenate([c.codes for c in to_union]) | ||
|
||
if sort_categories and ordered: | ||
raise TypeError("Cannot use sort_categories=True with " | ||
"ordered Categoricals") | ||
|
||
if sort_categories and not categories.is_monotonic_increasing: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do not sort |
||
categories = categories.sort_values() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think sort can be skipped if categories is monotonic-increasing. |
||
indexer = first.categories.get_indexer(categories) | ||
new_codes = take_1d(indexer, new_codes, fill_value=-1) | ||
elif all(not c.ordered for c in to_union): | ||
# not ordered | ||
pass | ||
# different categories - union and recode | ||
cats = first.categories.append([c.categories for c in to_union[1:]]) | ||
categories = Index(cats.unique()) | ||
if sort_categories: | ||
categories = categories.sort_values() | ||
|
||
new_codes = [] | ||
for c in to_union: | ||
if len(c.categories) > 0: | ||
indexer = categories.get_indexer(c.categories) | ||
new_codes.append(take_1d(indexer, c.codes, fill_value=-1)) | ||
else: | ||
# must be all NaN | ||
new_codes.append(c.codes) | ||
new_codes = np.concatenate(new_codes) | ||
else: | ||
# to show a proper error message | ||
# ordered - to show a proper error message | ||
if all(c.ordered for c in to_union): | ||
msg = ("to union ordered Categoricals, " | ||
"all categories must be the same") | ||
raise TypeError(msg) | ||
else: | ||
raise TypeError('Categorical.ordered must be the same') | ||
|
||
cats = first.categories | ||
unique_cats = cats.append([c.categories for c in to_union[1:]]).unique() | ||
categories = Index(unique_cats) | ||
|
||
new_codes = [] | ||
for c in to_union: | ||
if len(c.categories) > 0: | ||
indexer = categories.get_indexer(c.categories) | ||
new_codes.append(take_1d(indexer, c.codes, fill_value=-1)) | ||
else: | ||
# must be all NaN | ||
new_codes.append(c.codes) | ||
|
||
new_codes = np.concatenate(new_codes) | ||
return Categorical(new_codes, categories=categories, ordered=False, | ||
return Categorical(new_codes, categories=categories, ordered=ordered, | ||
fastpath=True) | ||
|
||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put companion tests with
sort_categories=False