Skip to content

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float #10304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 9, 2015

Conversation

evanpw
Copy link
Contributor

@evanpw evanpw commented Jun 6, 2015

Fixes GH #10156. This also makes different null values indistinguishable inside of remove_categories, but they're already indistinguishable in most other contexts:

>>> pd.Categorical([], categories=[np.nan, None])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/categorical.py", line 289, in __init__
    categories = self._validate_categories(categories)
  File "pandas/core/categorical.py", line 447, in _validate_categories
    raise ValueError('Categorical categories must be unique')
ValueError: Categorical categories must be unique

@jreback jreback added Bug Categorical Categorical Data Type labels Jun 7, 2015
@jreback jreback added this to the 0.16.2 milestone Jun 7, 2015
result = result.remove_categories(np.nan)
expected = Categorical([], categories=[1.0, 2.0])
self.assert_categorical_equal(result, expected)

def test_remove_unused_categories(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this as well >>> pd.Categorical([], categories=[np.nan, None])
also add in using None in the .remove_categories. Pls test on an object dtype as well as floating-point.
Bonus points if you can make this work for datetimelike (e.g. using pd.NaT.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

can you update

@evanpw
Copy link
Contributor Author

evanpw commented Jun 9, 2015

I think these tests are what you were asking for. Let me know if otherwise.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

yep, ping when green.

@evanpw
Copy link
Contributor Author

evanpw commented Jun 9, 2015

It turns out that this already works for datetimelike categoricals. I added some tests and squashed.

jreback added a commit that referenced this pull request Jun 9, 2015
BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float
@jreback jreback merged commit 2619889 into pandas-dev:master Jun 9, 2015
@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

awesome @evanpw ! keep em coming

@wikiped
Copy link

wikiped commented Jun 10, 2015

Thanks very much for fixing this. I am not sure if I got this right but would not it make sense to reorder the code slightly to avoid trying removing categories twice when nan is present? So use this code:

    # GH 10156
    if any(isnull(removals)):
        not_included = [x for x in not_included if notnull(x)]
        new_categories = [x for x in new_categories if notnull(x)]
    else:
        removal_set = set(list(removals))
        not_included = removal_set - set(self._categories)
        new_categories = [ c for c in self._categories if c not in removal_set ]

Instead of this:

    removal_set = set(list(removals))
    not_included = removal_set - set(self._categories)
    new_categories = [ c for c in self._categories if c not in removal_set ]

    # GH 10156
    if any(isnull(removals)):
        not_included = [x for x in not_included if notnull(x)]
        new_categories = [x for x in new_categories if notnull(x)]

@evanpw
Copy link
Contributor Author

evanpw commented Jun 10, 2015

In your if branch, what's the original definition of not_included and new_categories? It looks like they're defined in terms of themselves.

@wikiped
Copy link

wikiped commented Jun 10, 2015

Fair point ;)
Thanks again for the fix.

@evanpw evanpw deleted the remove_cat_nan branch June 10, 2015 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants