BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float #10304

evanpw · 2015-06-06T18:53:16Z

Fixes GH #10156. This also makes different null values indistinguishable inside of remove_categories, but they're already indistinguishable in most other contexts:

>>> pd.Categorical([], categories=[np.nan, None])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/categorical.py", line 289, in __init__
    categories = self._validate_categories(categories)
  File "pandas/core/categorical.py", line 447, in _validate_categories
    raise ValueError('Categorical categories must be unique')
ValueError: Categorical categories must be unique

jreback · 2015-06-07T21:21:57Z

pandas/tests/test_categorical.py

+        result = result.remove_categories(np.nan)
+        expected = Categorical([], categories=[1.0, 2.0])
+        self.assert_categorical_equal(result, expected)
+
    def test_remove_unused_categories(self):


can you add a test for this as well >>> pd.Categorical([], categories=[np.nan, None])
also add in using None in the .remove_categories. Pls test on an object dtype as well as floating-point.
Bonus points if you can make this work for datetimelike (e.g. using pd.NaT.

jreback · 2015-06-09T10:49:48Z

can you update

evanpw · 2015-06-09T15:43:43Z

I think these tests are what you were asking for. Let me know if otherwise.

jreback · 2015-06-09T16:00:11Z

yep, ping when green.

…e is float (GH pandas-dev#10156)

evanpw · 2015-06-09T22:45:31Z

It turns out that this already works for datetimelike categoricals. I added some tests and squashed.

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float

jreback · 2015-06-09T23:46:07Z

awesome @evanpw ! keep em coming

wikiped · 2015-06-10T03:33:02Z

Thanks very much for fixing this. I am not sure if I got this right but would not it make sense to reorder the code slightly to avoid trying removing categories twice when nan is present? So use this code:

    # GH 10156
    if any(isnull(removals)):
        not_included = [x for x in not_included if notnull(x)]
        new_categories = [x for x in new_categories if notnull(x)]
    else:
        removal_set = set(list(removals))
        not_included = removal_set - set(self._categories)
        new_categories = [ c for c in self._categories if c not in removal_set ]

Instead of this:

    removal_set = set(list(removals))
    not_included = removal_set - set(self._categories)
    new_categories = [ c for c in self._categories if c not in removal_set ]

    # GH 10156
    if any(isnull(removals)):
        not_included = [x for x in not_included if notnull(x)]
        new_categories = [x for x in new_categories if notnull(x)]

evanpw · 2015-06-10T12:46:59Z

In your if branch, what's the original definition of not_included and new_categories? It looks like they're defined in terms of themselves.

wikiped · 2015-06-10T13:15:28Z

Fair point ;)
Thanks again for the fix.

jreback added Bug Categorical Categorical Data Type labels Jun 7, 2015

jreback added this to the 0.16.2 milestone Jun 7, 2015

jreback reviewed Jun 7, 2015
View reviewed changes

BUG: Categorical.remove_categories(np.nan) fails when underlying dtyp…

e462c34

…e is float (GH pandas-dev#10156)

evanpw force-pushed the remove_cat_nan branch from 2499704 to e462c34 Compare June 9, 2015 19:10

jreback added a commit that referenced this pull request Jun 9, 2015

Merge pull request #10304 from evanpw/remove_cat_nan

2619889

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float

jreback merged commit 2619889 into pandas-dev:master Jun 9, 2015

jreback mentioned this pull request Jun 9, 2015

.remove_category(np.nan) fails on Categorical with floats #10156

Closed

evanpw deleted the remove_cat_nan branch June 10, 2015 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float #10304

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float #10304

evanpw commented Jun 6, 2015

jreback Jun 7, 2015

jreback commented Jun 9, 2015

evanpw commented Jun 9, 2015

jreback commented Jun 9, 2015

evanpw commented Jun 9, 2015

jreback commented Jun 9, 2015

wikiped commented Jun 10, 2015

evanpw commented Jun 10, 2015

wikiped commented Jun 10, 2015

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float #10304

BUG: Categorical.remove_categories(np.nan) fails when underlying dtype is float #10304

Conversation

evanpw commented Jun 6, 2015

jreback Jun 7, 2015

Choose a reason for hiding this comment

jreback commented Jun 9, 2015

evanpw commented Jun 9, 2015

jreback commented Jun 9, 2015

evanpw commented Jun 9, 2015

jreback commented Jun 9, 2015

wikiped commented Jun 10, 2015

evanpw commented Jun 10, 2015

wikiped commented Jun 10, 2015