Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BUG: Fix construction of Categorical from pd.NA #31939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix construction of Categorical from pd.NA #31939
Changes from 10 commits
99dbff4
81516a6
78d62f9
38fede6
52466ab
1a71728
bad5be3
b051bf0
563b673
9066789
7da4e44
d1a953b
baab1d5
062f5f7
2d45b21
14a737d
f0eb9f3
a54fe0d
17de660
78e38ec
0efcdb0
a04df9b
3c5082e
d50f963
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another value after nulls_fixture here to show that we are actually getting the correct codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this request. I did just realize though that
pd.NA
is not even part ofnulls_fixture
so will need to change this somehow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is, rebase on master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values = ['a', nulls_fixture, 'b']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I looking at the wrong fixture? I wasn't seeing pd.NA here: https://github.com/pandas-dev/pandas/blob/master/pandas/conftest.py#L444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be added in #31799
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is a very good test. I mean: it is testing that lists vs object array are giving the same result (which is useful anyhow, as those should be consistent), but it is not testing how they are now constructed (eg it won't "preserve" pd.NA, and this is also not tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsaxton can you parameterize this on klass (np.array and list), then hard code the results in a categorical (meaning use _from_codes and an explict list of categories)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, pd.NA is actually converted to np.nan here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably not ideal (but better than an error). If merged would a follow-up issue to make sure pd.NA is used make sense? Or I could mark that the referenced issue is not actually closed and comment there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, pls do it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followups are ok, but for relatively small things just fixing it in the same PR is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to look more closely, but I'm not sure if having it return pd.NA instead of np.nan is an easy fix; this is already how it behaves for list input (which seems to be the documented behavior):
I think having it so that we at least get the same output and not an error for a numpy array with object dtype is still an improvement though? What are your thoughts @WillAyd ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it would be nice to maintain pd.NA - do you know the extra effort involved to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I'd need to investigate a bit more. The logic doesn't seem too obvious though; should
pd.NA
be the default for missing values, or only when it's explicitly encountered? How should mixed missing value types get handled during construction (e.g.,pd.Categorical(["a", np.nan, pd.NA])
)? Personally I think havingpd.NA
as the default makes sense, but that seems like a large change.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if @jorisvandenbossche has a preference? Always using
pd.NA
seems logical, probably just a question of "when" to implement that (as a lot of people are likely usingCategorical
and expecting to seeNaN
).