-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: Categorical with np.str_ categories #31528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Is this something we want to support long-term?
@@ -408,6 +408,11 @@ def test_constructor_str_unknown(self): | |||
with pytest.raises(ValueError, match="Unknown dtype"): | |||
Categorical([1, 2], dtype="foo") | |||
|
|||
def test_constructor_np_strs(self): | |||
# GH#31499 Hastable.map_locations needs to work on np.str_ objects | |||
cat = pd.Categorical(["1", "0", "1"], [np.str_("0"), np.str_("1")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think we should be santizing these inputs on Index construction; we already do this for Series IIRC (deep in block manager i think). Can you see if that's better (as its a more maintainable long term soln). Not averse to your change, but that's after we already saved the inputs.
Co-Authored-By: Tom Augspurger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer to actually fix this rather than work around np.str_ which is not used anywhere else.
What's the proposed fix? Convert these to regular strings early on? IMO, we should deprecate the old behavior first if it isn't too costly. This PR doesn't seem too bad, though I don't really understand all its implications. |
I thought as well, but that doesn't seem to be the case:
I would keep a possible sanitation for a separate issue / discussion. This PRs seems an easy fix to address the regression. |
this is just an unsupportable bandaid. this is handling a leak of np.str_ into the internals, which is really bad. I don't think its worth trying to fix this for 1.0.1 like this, rather address a systematic real fix. The reason this is unsupportable is that now this hides this issue in one particular place, rather than actually handling it (by converting np.str_ to str on construction). |
It is what we did before for years. So I don't think it is that unsupportable. Since we can still store numpy strings in a Series, and since we supported converting those to a Categorical before, I think this is a good fix. For 1.1, we can discuss further if we want to keep supporting this, or want to deprecate it, or want to sanitize on input (to avoid needing to support it). |
Agreed. |
these last minute patches are just causing more and more issues. do what you will this. |
@jbrockmendel are you aware of any maintenance burdens or ambiguities this might cause? At least on 0.25.3, we seem to treat |
Not especially. If we want to root np.str_ out entirely, that would be a pretty big endeavor. |
OK. Naively, I agree that rooting out |
Thanks @jbrockmendel. |
Co-authored-by: jbrockmendel <[email protected]>
numpy.str_
s as catgories #31499black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff