Skip to content

REGR: Categorical with np.str_ categories #31528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 4, 2020

Conversation

jbrockmendel
Copy link
Member

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Is this something we want to support long-term?

@jreback jreback added this to the 1.0.1 milestone Feb 1, 2020
@jreback jreback added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Feb 1, 2020
@@ -408,6 +408,11 @@ def test_constructor_str_unknown(self):
with pytest.raises(ValueError, match="Unknown dtype"):
Categorical([1, 2], dtype="foo")

def test_constructor_np_strs(self):
# GH#31499 Hastable.map_locations needs to work on np.str_ objects
cat = pd.Categorical(["1", "0", "1"], [np.str_("0"), np.str_("1")])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think we should be santizing these inputs on Index construction; we already do this for Series IIRC (deep in block manager i think). Can you see if that's better (as its a more maintainable long term soln). Not averse to your change, but that's after we already saved the inputs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer to actually fix this rather than work around np.str_ which is not used anywhere else.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 3, 2020

What's the proposed fix? Convert these to regular strings early on?

IMO, we should deprecate the old behavior first if it isn't too costly. This PR doesn't seem too bad, though I don't really understand all its implications.

@jorisvandenbossche
Copy link
Member

I think we should be santizing these inputs on Index construction; we already do this for Series IIRC

I thought as well, but that doesn't seem to be the case:

In [1]: s = pd.Series([np.str_("0"), np.str_("1")])  

In [2]: s[0]        
Out[2]: '0'

In [3]: type(s[0])  
Out[3]: numpy.str_

I would keep a possible sanitation for a separate issue / discussion. This PRs seems an easy fix to address the regression.

@jreback
Copy link
Contributor

jreback commented Feb 3, 2020

this is just an unsupportable bandaid. this is handling a leak of np.str_ into the internals, which is really bad. I don't think its worth trying to fix this for 1.0.1 like this, rather address a systematic real fix.

The reason this is unsupportable is that now this hides this issue in one particular place, rather than actually handling it (by converting np.str_ to str on construction).

@jorisvandenbossche
Copy link
Member

this is just an unsupportable bandaid. this is handling a leak of np.str_ into the internals, which is really bad.

It is what we did before for years. So I don't think it is that unsupportable.

Since we can still store numpy strings in a Series, and since we supported converting those to a Categorical before, I think this is a good fix.

For 1.1, we can discuss further if we want to keep supporting this, or want to deprecate it, or want to sanitize on input (to avoid needing to support it).

@TomAugspurger
Copy link
Contributor

For 1.1, we can discuss further if we want to keep supporting this, or want to deprecate it, or want to sanitize on input (to avoid needing to support it).

Agreed.

@jreback
Copy link
Contributor

jreback commented Feb 3, 2020

these last minute patches are just causing more and more issues. do what you will this.

@TomAugspurger
Copy link
Contributor

@jbrockmendel are you aware of any maintenance burdens or ambiguities this might cause?

At least on 0.25.3, we seem to treat str and np.str_ equivalently in methods like get_loc and unique

@jbrockmendel
Copy link
Member Author

are you aware of any maintenance burdens or ambiguities this might cause?

Not especially. If we want to root np.str_ out entirely, that would be a pretty big endeavor.

@TomAugspurger
Copy link
Contributor

OK. Naively, I agree that rooting out np.str_ objects in our constructors sounds difficult. At the least, it would require a scan over the values and a couple isinstance checks on each value, which I'd like to avoid if it's not causing problems elsewhere. I'm sure we'll get reports if it is.

@TomAugspurger TomAugspurger merged commit 01582c4 into pandas-dev:master Feb 4, 2020
@TomAugspurger
Copy link
Contributor

Thanks @jbrockmendel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas 1.0 no longer handles numpy.str_s as catgories
4 participants