-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix categorical comparison with missing values (#26504 ) #26514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @yanglinlee! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-01 00:32:57 UTC |
Codecov Report
@@ Coverage Diff @@
## master #26514 +/- ##
==========================================
- Coverage 91.76% 91.75% -0.01%
==========================================
Files 174 174
Lines 50679 50684 +5
==========================================
Hits 46504 46504
- Misses 4175 4180 +5
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26514 +/- ##
==========================================
- Coverage 91.84% 91.83% -0.01%
==========================================
Files 174 174
Lines 50644 50647 +3
==========================================
- Hits 46515 46514 -1
- Misses 4129 4133 +4
Continue to review full report at Codecov.
|
@yanglinlee : Thanks for submitting this! Can you also add a |
Sure! Done~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to be clear this has nothing to do with the None
value right? Isn't this generally applicable to values that don't fall into a category and if so can you update whatsnew / commentary and tests to reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzled of the purpose of this PR. We don't' allow any comparison of categorical except for eq / ne for non-ordered. I would be ok with simply raising the same TypeError message that we already do. But what is the purpose of this expansion?
Just want to get some clarification:
If we allow missing values in the array, then the result should not be raising errors. Scenario 2: Values that don't fall into a category as WillAyd suggested.
Currently it doesn't raise TypeError. If you want to raise TypeError for both scenarios, I can change the behavior of the comparison with scalar. Let me know your suggestions. Thanks. |
if the above scenarios compare with a listlike they raise i believe |
I tested comparison with listlike, and it didn't raise TypeError. It only takes comparison with None as False.
In this case, do we need to change the behavior of comparison to listlike? So in both cases, the comparison to None will raise Error. |
The purpose is quite clear to me: fixing a bug. Which is clearly described in the issue: #26504 The bug is that when doing a comparison (which is allowed for ordered categoricals), missing values always evaluate to True or False (depending on which operations, because they are encoded as -1 and we use that value in the comparison). To show this in another way:
The object dtype has the correct behaviour: a missing value gives False in the result.
This is not about "None", or also not about values that don't fall within the category, but is is about missing values (but both None as non-existent categories when constructing the categorical lead to missing values in the actual categorical) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking good to me, thanks for the PR!
Can you address the comment of @gfyoung about the whatsnew message?
this needs to raise for non ordered categoricals that is not addressed in this PR as there are no tests for this furthermore the behavior for listlikes must also be the same (raise for non ordered) |
I think the issue was about ordered Categoricals. unordered does indeed raise In [3]: import pandas as pd
...: pd.Categorical(["1", "2", "3", None], categories=["1", "2", "3"]) <= "2"
...:
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-1804d8ae703c> in <module>
1 import pandas as pd
----> 2 pd.Categorical(["1", "2", "3", None], categories=["1", "2", "3"]) <= "2"
3
~/sandbox/pandas/pandas/core/arrays/categorical.py in f(self, other)
64 if not self.ordered:
65 if op in ['__lt__', '__gt__', '__le__', '__ge__']:
---> 66 raise TypeError("Unordered Categoricals can only compare "
67 "equality or not")
68 if isinstance(other, Categorical):
TypeError: Unordered Categoricals can only compare equality or not |
This does already raise, that is not the issue.
This does already work for listlikes. Actually this PR basically copied the code from the code branch for list-likes to the code branch of scalars. |
Gotcha - so in both in the original issue and here in the whatsnew we keep referring to @yanglinlee looks like you might have some other changes to the whatsnew but can you account for that on next edit? Ping back if you are stuck on wording |
Yes, it was by using "None" instead of "missing values" that caused some confusion I think. I already changed the title of this PR. |
Thank everyone for the helpful discussions! This is great suggestion! I will update the PR after work tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanglinlee Thanks for the updates. Added a few more comments.
Co-Authored-By: Joris Van den Bossche <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm outside of @jreback comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what happened with the last CI run. Try merging master and pushing again.
@yanglinlee CI is being fixed in #26586. Merge master & repush once that's in. |
thanks @yanglinlee |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR fixes issues comparison of ordered categorical with missing values evaluates to True(#26504 ). Now making comparison with missing values always gives False. The missing values could be introduced when constructing a Categorical with values that don't fall in with the categories provided.