-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Categorical.unique() preserves categories #15439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
you are changing |
pandas/tests/test_categorical.py
Outdated
self.assert_index_equal(res.categories, exp) | ||
tm.assert_categorical_equal(res, Categorical(exp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing should be changing in this file
tm.assert_index_equal(df.groupby('A', sort=False).first().index, index) | ||
|
||
# ordered=False | ||
df = DataFrame({'A': pd.Categorical(list('ba'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC from the issue there are 4 cases? (product of sort=True/False, ordered=True/False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupby
above and below is called twice (sort=True/False
) for each case of ordered
. There are four assertions.
0d0d0bf
to
8aea45f
Compare
tm.assert_index_equal(df.groupby('A', sort=False).first().index, index) | ||
|
||
# ordered=False | ||
df = DataFrame({'A': pd.Categorical(list('ba'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupby
above and below is called twice (sort=True/False
) for each case of ordered
. There are four assertions.
pandas/core/groupby.py
Outdated
@@ -2315,6 +2315,9 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None, | |||
# groupby | |||
else: | |||
cat = self.grouper.unique() | |||
cat.add_categories([c for c in self.grouper.categories | |||
if c not in cat.categories], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How strongly would self.grouper.categories[~self.grouper.categories.isin(cat.categories)]
be preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like that better! (also vectorized).
as an aside, prob should add an asv for categorical grouping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of like this? I don't know what I'm doing. 8) flake8 diff surely doesn't pass now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do:
git diff master | flake8 --diff
for flake8 checking locally :>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but the asv additions cannot.
bafd99d
to
81a2183
Compare
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -539,6 +539,7 @@ Bug Fixes | |||
|
|||
- Bug in ``resample``, where a non-string ```loffset`` argument would not be applied when resampling a timeseries (:issue:`13218`) | |||
|
|||
- Bug in ``.groupby`` where ```.groupby(categorical, sort=False)`` would raise ``ValueError`` due to non-matching categories (:issue:`13179`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a subsection for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A whole subsection for an obvious bug? What would the title be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this an obvious bug? a sub-section is not very long just showing what it did before and what it does now. Your whatsnew is so non-obvious now (and just changing wording is not going to help). An example is worth 1000 words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant obvious as in the combination of sort=False
and a missing category in the data causing:
ValueError: items in new_categories are not the same as in old categories
There is no other change in behavior except this now works as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not obvious at all. you can simply use the test example, show the old behavior and the new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Is this, by chance, a subsection of Bug Fixes? It doesn't feel like api_breaking nor a particularly mentionable enhancement ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this?
pandas/core/categorical.py
Outdated
@@ -1851,7 +1851,7 @@ def unique(self): | |||
""" | |||
|
|||
# unlike np.unique, unique1d does not sort | |||
unique_codes = unique1d(self.codes) | |||
unique_codes = unique1d(self.codes).astype(self.codes.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use .astype(copy=False)
this is actually a bug (separate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you don't need this change. This is just for taking (which needs to be int64 anyhow; it will upcast it later if its not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of groupby it will be upcast, but in general, .unique() should be type preserving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, never mind. :)
pandas/core/groupby.py
Outdated
@@ -2315,6 +2315,11 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None, | |||
# groupby | |||
else: | |||
cat = self.grouper.unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I am wondering why we are doing any of this. isn't self.grouper.categories
the end result that we want anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in case of sort=False
, some expect the result categories sorted in the order of encounter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, ok for that case then (though try to use vectorized operations here). In fact probably create a method on Categorical
, call it ._codes_for_grouping(sort=True/False)
where you can handle this in a method (rather than clutter it up here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is vectorized, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.grouper
can also be a CategoricalIndex
. How do I get one from the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that is fine (though return a new object, never use inplace
).
add the method to a CI
as well (which just calls it on the .values
), its a pass thru
Codecov Report
@@ Coverage Diff @@
## master #15439 +/- ##
==========================================
+ Coverage 90.37% 90.37% +<.01%
==========================================
Files 135 135
Lines 49476 49481 +5
==========================================
+ Hits 44713 44720 +7
+ Misses 4763 4761 -2
Continue to review full report at Codecov.
|
8f47bd9
to
a7cd756
Compare
pandas/core/categorical.py
Outdated
@@ -602,6 +602,21 @@ def _get_categories(self): | |||
categories = property(fget=_get_categories, fset=_set_categories, | |||
doc=_categories_doc) | |||
|
|||
def _codes_for_groupby(self, sort=False): | |||
""" Return a Categorical adjusted for groupby """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can add a doc-string (Parameters, Returns), just like its a publick method.
a7cd756
to
52c1bc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good
just like to give the docs another pass
pandas/core/categorical.py
Outdated
def _codes_for_groupby(self, sort): | ||
""" | ||
Return a Categorical adjusted for groupby | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted is not very clear ; can you make this a bit more clear (and the comments below)
pretend you have never seen this and see it from a new reader perspective
4842bc7
to
82ff3d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I only have one question: should we have different behaviour of sort=False
depending on whether the categorical is ordered or not?
Because now the sort=False
seems to be ignored in case of an ordered categorical?
doc/source/whatsnew/v0.20.0.txt
Outdated
GroupBy on Categoricals | ||
^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In previous version, ``.groupby(..., sort=False)`` would fail with a ``ValueError`` when grouping on a categorical series with some categories not appearing in the data. (:issue:`13179`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous version -> previous versions
82ff3d4
to
55733b8
Compare
hmm, @kernc can you take a look at the original issue (in the reference from the code) and see if you can divine what is happening? |
Indeed, the order is retained by Semantically, I don't think the order should be lost if the categorical is ordered. The relationships This lgtm; please edit to your liking. |
thanks @kernc nice PR! keep em coming! |
closes pandas-dev#13179 Author: Kernc <[email protected]> Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits: 55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing 2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing c813146 [Kernc] PERF: add asv for categorical grouping 0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing
git diff upstream/master | flake8 --diff