-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix construction of Categorical from pd.NA #31939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -458,6 +458,14 @@ def test_constructor_with_categorical_categories(self): | |||
result = Categorical(["a", "b"], categories=CategoricalIndex(["a", "b", "c"])) | |||
tm.assert_categorical_equal(result, expected) | |||
|
|||
def test_construction_with_null(self, nulls_fixture): | |||
# https://github.com/pandas-dev/pandas/issues/31927 | |||
values = ["a", nulls_fixture] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another value after nulls_fixture here to show that we are actually getting the correct codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this request. I did just realize though that pd.NA
is not even part of nulls_fixture
so will need to change this somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is, rebase on master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values = ['a', nulls_fixture, 'b']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I looking at the wrong fixture? I wasn't seeing pd.NA here: https://github.com/pandas-dev/pandas/blob/master/pandas/conftest.py#L444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be added in #31799
def test_multiindex_from_product_contains_na(self): | ||
# https://github.com/pandas-dev/pandas/issues/31883 | ||
values1 = [np.array([0.0, pd.NA], dtype="object"), ["a", "b"]] | ||
values2 = [np.array([0.0, np.nan], dtype="object"), ["a", "b"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, pd.NA is actually converted to np.nan here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably not ideal (but better than an error). If merged would a follow-up issue to make sure pd.NA is used make sense? Or I could mark that the referenced issue is not actually closed and comment there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, pls do it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followups are ok, but for relatively small things just fixing it in the same PR is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to look more closely, but I'm not sure if having it return pd.NA instead of np.nan is an easy fix; this is already how it behaves for list input (which seems to be the documented behavior):
In [1]: import pandas as pd
...:
...: values = ["a", pd.NA]
...:
...: pd.Categorical(values)
...:
Out[1]:
[a, NaN]
Categories (1, object): [a]
In [2]: pd.__version__
Out[2]: '1.0.1'
I think having it so that we at least get the same output and not an error for a numpy array with object dtype is still an improvement though? What are your thoughts @WillAyd ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree it would be nice to maintain pd.NA - do you know the extra effort involved to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I'd need to investigate a bit more. The logic doesn't seem too obvious though; should pd.NA
be the default for missing values, or only when it's explicitly encountered? How should mixed missing value types get handled during construction (e.g., pd.Categorical(["a", np.nan, pd.NA])
)? Personally I think having pd.NA
as the default makes sense, but that seems like a large change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if @jorisvandenbossche has a preference? Always using pd.NA
seems logical, probably just a question of "when" to implement that (as a lot of people are likely using Categorical
and expecting to see NaN
).
also pls run appropriate asvs this is a very performance sensitive path |
Not sure if others have encountered this, but I get a strange error trying to run the benchmark suite:
|
This reverts commit 14a737d.
tuples = [(0.0, "a"), (0.0, "b"), (np.nan, "a"), (np.nan, "b")] | ||
|
||
result = pd.MultiIndex.from_product(values) | ||
expected = pd.MultiIndex.from_tuples(tuples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not correct, pd.NA should be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a non-trivial patch and i would separate it from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this issue from the PR as this is not correct.
BTW (short reaction, will look in more detail at the PR tomorrow), I don't we can preserve pd.NA in Categorical: because we can't yet have a Cateogorical with string dtype categories (you can't yet have an Index with nullable dtype). That's also why I said in the issue that I was not sure it should actually work. Although I assume it is nice that it works and falls back to the normal object dtype backed categorical. But that means we will change behaviour when we can actually have a string dtype backed Categorical. |
# https://github.com/pandas-dev/pandas/issues/31927 | ||
values = ["a", nulls_fixture, "b"] | ||
result = Categorical(np.array(values, dtype=object)) | ||
expected = Categorical(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is a very good test. I mean: it is testing that lists vs object array are giving the same result (which is useful anyhow, as those should be consistent), but it is not testing how they are now constructed (eg it won't "preserve" pd.NA, and this is also not tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsaxton can you parameterize this on klass (np.array and list), then hard code the results in a categorical (meaning use _from_codes and an explict list of categories)
Co-Authored-By: Joris Van den Bossche <[email protected]>
tuples = [(0.0, "a"), (0.0, "b"), (np.nan, "a"), (np.nan, "b")] | ||
|
||
result = pd.MultiIndex.from_product(values) | ||
expected = pd.MultiIndex.from_tuples(tuples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this issue from the PR as this is not correct.
Not sure why that test needed to be removed, as it still tests the behaviour that this PR is enabling. See my comment above (#31939 (comment)) about that it is currently not possible to have pd.NA in Categoricals or MultiIndex. So either we are fine with the error you get now (and so this PR is not needed) or either we are fine with it being converted to np.nan (for now), but then we should also test that? |
# https://github.com/pandas-dev/pandas/issues/31927 | ||
values = ["a", nulls_fixture, "b"] | ||
result = Categorical(np.array(values, dtype=object)) | ||
expected = Categorical(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsaxton can you parameterize this on klass (np.array and list), then hard code the results in a categorical (meaning use _from_codes and an explict list of categories)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @dsaxton, comment about a followup issue for the changes to use checknull.
@meeseeksdev backport to 1.0.x |
…32200) Co-authored-by: Daniel Saxton <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff