REGR: Categorical with np.str_ categories #31528

jbrockmendel · 2020-02-01T03:40:31Z

closes Pandas 1.0 no longer handles numpy.str_s as catgories #31499
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

TomAugspurger

LGTM. Is this something we want to support long-term?

doc/source/whatsnew/v1.0.1.rst

jreback · 2020-02-01T15:06:53Z

pandas/tests/arrays/categorical/test_constructors.py

@@ -408,6 +408,11 @@ def test_constructor_str_unknown(self):
        with pytest.raises(ValueError, match="Unknown dtype"):
            Categorical([1, 2], dtype="foo")

+    def test_constructor_np_strs(self):
+        # GH#31499 Hastable.map_locations needs to work on np.str_ objects
+        cat = pd.Categorical(["1", "0", "1"], [np.str_("0"), np.str_("1")])


actually, I think we should be santizing these inputs on Index construction; we already do this for Series IIRC (deep in block manager i think). Can you see if that's better (as its a more maintainable long term soln). Not averse to your change, but that's after we already saved the inputs.

Co-Authored-By: Tom Augspurger <[email protected]>

jreback

i would prefer to actually fix this rather than work around np.str_ which is not used anywhere else.

TomAugspurger · 2020-02-03T12:37:42Z

What's the proposed fix? Convert these to regular strings early on?

IMO, we should deprecate the old behavior first if it isn't too costly. This PR doesn't seem too bad, though I don't really understand all its implications.

jorisvandenbossche · 2020-02-03T12:42:11Z

I think we should be santizing these inputs on Index construction; we already do this for Series IIRC

I thought as well, but that doesn't seem to be the case:

In [1]: s = pd.Series([np.str_("0"), np.str_("1")])  

In [2]: s[0]        
Out[2]: '0'

In [3]: type(s[0])  
Out[3]: numpy.str_

I would keep a possible sanitation for a separate issue / discussion. This PRs seems an easy fix to address the regression.

jreback · 2020-02-03T13:12:10Z

this is just an unsupportable bandaid. this is handling a leak of np.str_ into the internals, which is really bad. I don't think its worth trying to fix this for 1.0.1 like this, rather address a systematic real fix.

The reason this is unsupportable is that now this hides this issue in one particular place, rather than actually handling it (by converting np.str_ to str on construction).

jorisvandenbossche · 2020-02-03T13:46:53Z

this is just an unsupportable bandaid. this is handling a leak of np.str_ into the internals, which is really bad.

It is what we did before for years. So I don't think it is that unsupportable.

Since we can still store numpy strings in a Series, and since we supported converting those to a Categorical before, I think this is a good fix.

For 1.1, we can discuss further if we want to keep supporting this, or want to deprecate it, or want to sanitize on input (to avoid needing to support it).

TomAugspurger · 2020-02-03T13:57:30Z

For 1.1, we can discuss further if we want to keep supporting this, or want to deprecate it, or want to sanitize on input (to avoid needing to support it).

Agreed.

jreback · 2020-02-03T15:12:59Z

these last minute patches are just causing more and more issues. do what you will this.

TomAugspurger · 2020-02-03T18:41:36Z

@jbrockmendel are you aware of any maintenance burdens or ambiguities this might cause?

At least on 0.25.3, we seem to treat str and np.str_ equivalently in methods like get_loc and unique

jbrockmendel · 2020-02-03T20:15:33Z

are you aware of any maintenance burdens or ambiguities this might cause?

Not especially. If we want to root np.str_ out entirely, that would be a pretty big endeavor.

TomAugspurger · 2020-02-04T16:23:43Z

OK. Naively, I agree that rooting out np.str_ objects in our constructors sounds difficult. At the least, it would require a scan over the values and a couple isinstance checks on each value, which I'd like to avoid if it's not causing problems elsewhere. I'm sure we'll get reports if it is.

TomAugspurger · 2020-02-04T16:24:04Z

Thanks @jbrockmendel.

Co-authored-by: jbrockmendel <[email protected]>

jbrockmendel added 2 commits January 31, 2020 19:38

REGR: Categorical with np.str_ categories

d096fd5

whatsnew

d106d29

TomAugspurger approved these changes Feb 1, 2020

View reviewed changes

doc/source/whatsnew/v1.0.1.rst Outdated Show resolved Hide resolved

jreback added this to the 1.0.1 milestone Feb 1, 2020

jreback added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Feb 1, 2020

jreback approved these changes Feb 1, 2020

View reviewed changes

jreback requested changes Feb 1, 2020

View reviewed changes

Update doc/source/whatsnew/v1.0.1.rst

b8b8e7c

Co-Authored-By: Tom Augspurger <[email protected]>

jreback requested changes Feb 1, 2020

View reviewed changes

jbrockmendel added 3 commits February 3, 2020 08:41

Merge branch 'master' of https://github.com/pandas-dev/pandas into npstr

da6acc3

rebase fixup

70553d8

Merge branch 'npstr' of github.com:jbrockmendel/pandas into npstr

8238cd8

move whatsnew

e78ad5f

jorisvandenbossche approved these changes Feb 4, 2020

View reviewed changes

TomAugspurger merged commit 01582c4 into pandas-dev:master Feb 4, 2020

meeseeksmachine mentioned this pull request Feb 4, 2020

Backport PR #31528 on branch 1.0.x (REGR: Categorical with np.str_ categories) #31654

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 4, 2020

Backport PR pandas-dev#31528: REGR: Categorical with np.str_ categories

bd6780b

jbrockmendel deleted the npstr branch February 4, 2020 16:31

TomAugspurger pushed a commit that referenced this pull request Feb 4, 2020

Backport PR #31528: REGR: Categorical with np.str_ categories (#31654)

4923fd3

Co-authored-by: jbrockmendel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Categorical with np.str_ categories #31528

REGR: Categorical with np.str_ categories #31528

jbrockmendel commented Feb 1, 2020

TomAugspurger left a comment

jreback Feb 1, 2020

jreback left a comment

TomAugspurger commented Feb 3, 2020 •

edited

Loading

jorisvandenbossche commented Feb 3, 2020

jreback commented Feb 3, 2020 •

edited

Loading

jorisvandenbossche commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

jreback commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

jbrockmendel commented Feb 3, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

REGR: Categorical with np.str_ categories #31528

REGR: Categorical with np.str_ categories #31528

Conversation

jbrockmendel commented Feb 1, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback Feb 1, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 3, 2020 • edited Loading

jorisvandenbossche commented Feb 3, 2020

jreback commented Feb 3, 2020 • edited Loading

jorisvandenbossche commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

jreback commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

jbrockmendel commented Feb 3, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 3, 2020 •

edited

Loading

jreback commented Feb 3, 2020 •

edited

Loading