BUG: Fix construction of Categorical from pd.NA #31939

dsaxton · 2020-02-12T21:17:04Z

closes Construction of Categorical from array with pd.NA failing #31927
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/_libs/hashtable_class_helper.pxi.in

pandas/tests/arrays/categorical/test_constructors.py

pandas/_libs/hashtable_class_helper.pxi.in

WillAyd

lgtm

doc/source/whatsnew/v1.0.2.rst

pandas/_libs/hashtable_class_helper.pxi.in

jreback · 2020-02-16T14:59:54Z

pandas/tests/arrays/categorical/test_constructors.py

@@ -458,6 +458,14 @@ def test_constructor_with_categorical_categories(self):
        result = Categorical(["a", "b"], categories=CategoricalIndex(["a", "b", "c"]))
        tm.assert_categorical_equal(result, expected)

+    def test_construction_with_null(self, nulls_fixture):
+        # https://github.com/pandas-dev/pandas/issues/31927
+        values = ["a", nulls_fixture]


can you add another value after nulls_fixture here to show that we are actually getting the correct codes

I'm not sure I understand this request. I did just realize though that pd.NA is not even part of nulls_fixture so will need to change this somehow.

it is, rebase on master

values = ['a', nulls_fixture, 'b']

Am I looking at the wrong fixture? I wasn't seeing pd.NA here: https://github.com/pandas-dev/pandas/blob/master/pandas/conftest.py#L444

Will be added in #31799

jreback · 2020-02-16T15:00:49Z

pandas/tests/indexing/multiindex/test_multiindex.py

+    def test_multiindex_from_product_contains_na(self):
+        # https://github.com/pandas-dev/pandas/issues/31883
+        values1 = [np.array([0.0, pd.NA], dtype="object"), ["a", "b"]]
+        values2 = [np.array([0.0, np.nan], dtype="object"), ["a", "b"]]


wait, pd.NA is actually converted to np.nan here?

Yes, probably not ideal (but better than an error). If merged would a follow-up issue to make sure pd.NA is used make sense? Or I could mark that the referenced issue is not actually closed and comment there.

no, pls do it here

followups are ok, but for relatively small things just fixing it in the same PR is better

I'd have to look more closely, but I'm not sure if having it return pd.NA instead of np.nan is an easy fix; this is already how it behaves for list input (which seems to be the documented behavior):

In [1]: import pandas as pd ...: ...: values = ["a", pd.NA] ...: ...: pd.Categorical(values) ...: Out[1]: [a, NaN] Categories (1, object): [a] In [2]: pd.__version__ Out[2]: '1.0.1'

I think having it so that we at least get the same output and not an error for a numpy array with object dtype is still an improvement though? What are your thoughts @WillAyd ?

Agree it would be nice to maintain pd.NA - do you know the extra effort involved to do so?

I'm not sure, I'd need to investigate a bit more. The logic doesn't seem too obvious though; should pd.NA be the default for missing values, or only when it's explicitly encountered? How should mixed missing value types get handled during construction (e.g., pd.Categorical(["a", np.nan, pd.NA]))? Personally I think having pd.NA as the default makes sense, but that seems like a large change.

Curious if @jorisvandenbossche has a preference? Always using pd.NA seems logical, probably just a question of "when" to implement that (as a lot of people are likely using Categorical and expecting to see NaN).

pandas/tests/indexing/multiindex/test_multiindex.py

pandas/_libs/hashtable_class_helper.pxi.in

jreback · 2020-02-17T18:19:28Z

also pls run appropriate asvs this is a very performance sensitive path

pandas/_libs/missing.pyx

dsaxton · 2020-02-17T18:45:57Z

also pls run appropriate asvs this is a very performance sensitive path

Not sure if others have encountered this, but I get a strange error trying to run the benchmark suite:

(pandas-dev) danielsaxton> asv continuous -f 1.1 upstream/master na-cat
Traceback (most recent call last):
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/bin/asv", line 5, in <module>
    from asv.main import main
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/__init__.py", line 32, in <module>
    from . import plugin_manager
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/plugin_manager.py", line 63, in <module>
    commands.__doc__ = commands._make_docstring()
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/commands/__init__.py", line 96, in _make_docstring
    parser, subparsers = make_argparser()
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/commands/__init__.py", line 84, in make_argparser
    subparser = commands[str(command)].setup_arguments(subparsers)
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/commands/machine.py", line 25, in setup_arguments
    defaults = machine.Machine.get_defaults()
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/machine.py", line 146, in get_defaults
    cpu = util.get_cpu_info()
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/util.py", line 996, in get_cpu_info
    sysctl = which('sysctl')
  File "/Users/danielsaxton/opt/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/util.py", line 361, in which
    raise IOError("Could not find '{0}' in {1}".format(filename, loc_info))
OSError: Could not find 'sysctl' in PATH

This reverts commit 14a737d.

jreback · 2020-02-17T20:12:40Z

pandas/tests/indexing/multiindex/test_multiindex.py

+        tuples = [(0.0, "a"), (0.0, "b"), (np.nan, "a"), (np.nan, "b")]
+
+        result = pd.MultiIndex.from_product(values)
+        expected = pd.MultiIndex.from_tuples(tuples)


this is not correct, pd.NA should be preserved.

this might be a non-trivial patch and i would separate it from this PR.

remove this issue from the PR as this is not correct.

jorisvandenbossche · 2020-02-17T21:56:59Z

BTW (short reaction, will look in more detail at the PR tomorrow), I don't we can preserve pd.NA in Categorical: because we can't yet have a Cateogorical with string dtype categories (you can't yet have an Index with nullable dtype).

That's also why I said in the issue that I was not sure it should actually work. Although I assume it is nice that it works and falls back to the normal object dtype backed categorical. But that means we will change behaviour when we can actually have a string dtype backed Categorical.

doc/source/whatsnew/v1.0.2.rst

pandas/_libs/hashtable_class_helper.pxi.in

jorisvandenbossche · 2020-02-18T08:02:08Z

pandas/tests/arrays/categorical/test_constructors.py

+        # https://github.com/pandas-dev/pandas/issues/31927
+        values = ["a", nulls_fixture, "b"]
+        result = Categorical(np.array(values, dtype=object))
+        expected = Categorical(values)


I am not sure this is a very good test. I mean: it is testing that lists vs object array are giving the same result (which is useful anyhow, as those should be consistent), but it is not testing how they are now constructed (eg it won't "preserve" pd.NA, and this is also not tested)

@dsaxton can you parameterize this on klass (np.array and list), then hard code the results in a categorical (meaning use _from_codes and an explict list of categories)

Co-Authored-By: Joris Van den Bossche <[email protected]>

pandas/_libs/hashtable_class_helper.pxi.in

jreback · 2020-02-18T23:57:02Z

pandas/tests/indexing/multiindex/test_multiindex.py

+        tuples = [(0.0, "a"), (0.0, "b"), (np.nan, "a"), (np.nan, "b")]
+
+        result = pd.MultiIndex.from_product(values)
+        expected = pd.MultiIndex.from_tuples(tuples)


remove this issue from the PR as this is not correct.

jorisvandenbossche · 2020-02-19T08:30:18Z

remove this issue from the PR as this is not correct.

Not sure why that test needed to be removed, as it still tests the behaviour that this PR is enabling.

See my comment above (#31939 (comment)) about that it is currently not possible to have pd.NA in Categoricals or MultiIndex. So either we are fine with the error you get now (and so this PR is not needed) or either we are fine with it being converted to np.nan (for now), but then we should also test that?

jreback · 2020-02-22T15:35:28Z

pandas/tests/arrays/categorical/test_constructors.py

+        # https://github.com/pandas-dev/pandas/issues/31927
+        values = ["a", nulls_fixture, "b"]
+        result = Categorical(np.array(values, dtype=object))
+        expected = Categorical(values)


@dsaxton can you parameterize this on klass (np.array and list), then hard code the results in a categorical (meaning use _from_codes and an explict list of categories)

jreback

thanks @dsaxton, comment about a followup issue for the changes to use checknull.

pandas/_libs/hashtable_class_helper.pxi.in

jreback · 2020-02-23T15:00:05Z

@meeseeksdev backport to 1.0.x

…om pd.NA

…32200) Co-authored-by: Daniel Saxton <[email protected]>

Daniel Saxton added 6 commits February 12, 2020 10:29

Add test with NA

99dbff4

Add GH issue

81516a6

Check for NA

78d62f9

Update whatsnew

38fede6

Add MultiIndex test

52466ab

whatsnew for MultiIndex

1a71728

WillAyd requested changes Feb 12, 2020

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Outdated Show resolved Hide resolved

pandas/tests/arrays/categorical/test_constructors.py Outdated Show resolved Hide resolved

Daniel Saxton added 2 commits February 12, 2020 17:25

Parametrize test

bad5be3

Use pd.NA

b051bf0

WillAyd requested changes Feb 13, 2020

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Outdated Show resolved Hide resolved

Import C_NA

563b673

WillAyd approved these changes Feb 13, 2020

View reviewed changes

WillAyd added Categorical Categorical Data Type MultiIndex NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 13, 2020

simonjayhawkins modified the milestones: 1.1, 1.0.2 Feb 15, 2020

Merge branch 'master' into na-cat

9066789

jreback requested changes Feb 16, 2020

View reviewed changes

Daniel Saxton added 5 commits February 16, 2020 09:39

Edit release note

7da4e44

Align conditions

d1a953b

Construct from tuples in test

baab1d5

Merge branch 'master' into na-cat

062f5f7

Add a string

2d45b21

jreback requested changes Feb 17, 2020

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved

Use checknull

14a737d

Merge branch 'master' into na-cat

f0eb9f3

dsaxton commented Feb 17, 2020

View reviewed changes

pandas/_libs/missing.pyx Outdated Show resolved Hide resolved

Revert "Use checknull"

a54fe0d

This reverts commit 14a737d.

jreback requested changes Feb 17, 2020

View reviewed changes

Merge branch 'master' into na-cat

17de660

jorisvandenbossche reviewed Feb 18, 2020

View reviewed changes

Update doc/source/whatsnew/v1.0.2.rst

78e38ec

Co-Authored-By: Joris Van den Bossche <[email protected]>

jreback requested changes Feb 18, 2020

View reviewed changes

Daniel Saxton and others added 2 commits February 18, 2020 18:40

Take out MultiIndex stuff

0efcdb0

Merge branch 'master' into na-cat

a04df9b

jreback requested changes Feb 22, 2020

View reviewed changes

dsaxton added 2 commits February 22, 2020 21:57

Merge remote-tracking branch 'upstream/master' into na-cat

3c5082e

Parametrize and hard code

d50f963

jreback approved these changes Feb 23, 2020

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved

jreback merged commit 41bc226 into pandas-dev:master Feb 23, 2020

jreback added the Still Needs Manual Backport label Feb 23, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 23, 2020

Backport PR pandas-dev#31939: BUG: Fix construction of Categorical fr…

b02b819

…om pd.NA

meeseeksmachine mentioned this pull request Feb 23, 2020

Backport PR #31939 on branch 1.0.x (BUG: Fix construction of Categorical from pd.NA) #32200

Merged

jreback removed the Still Needs Manual Backport label Feb 23, 2020

dsaxton deleted the na-cat branch February 23, 2020 15:10

jreback pushed a commit that referenced this pull request Feb 23, 2020

Backport PR #31939: BUG: Fix construction of Categorical from pd.NA (#…

791c7fc

…32200) Co-authored-by: Daniel Saxton <[email protected]>

dsaxton mentioned this pull request Feb 23, 2020

Clean up null checking #32206

Closed

roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020

BUG: Fix construction of Categorical from pd.NA (pandas-dev#31939)

e740263

TomAugspurger mentioned this pull request Mar 9, 2020

Add test for MultiIndex Construction with pd.NA #31883

Closed

simonjayhawkins mentioned this pull request Apr 23, 2020

pd.NA TypeError in drop_duplicates with object dtype #32992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix construction of Categorical from pd.NA #31939

BUG: Fix construction of Categorical from pd.NA #31939

dsaxton commented Feb 12, 2020 •

edited

Loading

WillAyd left a comment

jreback Feb 16, 2020

dsaxton Feb 16, 2020

jreback Feb 16, 2020

jreback Feb 16, 2020

dsaxton Feb 16, 2020

WillAyd Feb 16, 2020

jreback Feb 16, 2020

dsaxton Feb 16, 2020

jreback Feb 16, 2020

jreback Feb 16, 2020

dsaxton Feb 16, 2020

WillAyd Feb 16, 2020

dsaxton Feb 16, 2020 •

edited

Loading

dsaxton Feb 17, 2020

jreback commented Feb 17, 2020

dsaxton commented Feb 17, 2020

jreback Feb 17, 2020

jreback Feb 17, 2020

jreback Feb 18, 2020

jorisvandenbossche commented Feb 17, 2020

jorisvandenbossche Feb 18, 2020

jreback Feb 22, 2020

jreback Feb 18, 2020

jorisvandenbossche commented Feb 19, 2020

jreback Feb 22, 2020

jreback left a comment

jreback commented Feb 23, 2020

BUG: Fix construction of Categorical from pd.NA #31939

BUG: Fix construction of Categorical from pd.NA #31939

Conversation

dsaxton commented Feb 12, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton Feb 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 17, 2020

dsaxton commented Feb 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 19, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Feb 23, 2020

dsaxton commented Feb 12, 2020 •

edited

Loading

dsaxton Feb 16, 2020 •

edited

Loading