DTYPE: use a Categorical for bool dtypes #15888

jreback · 2017-04-04T13:44:45Z

this may seem counter intutive, but this would allow us to store nulls as well (efficiently, rather than as object dtype), or casting to floats.

I am sure if this would really be possible w/o some API breaks, so will have a look.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-04-04T13:52:22Z

Nice idea, xref #15751 - I was sort of backing into the same concept

jorisvandenbossche · 2017-04-04T13:53:07Z

Not sure that will work with boolean indexing

jreback · 2017-04-04T14:00:14Z

It works, but would need some internal modifications to make this compat

In [9]: df = pd.DataFrame({'A':[1,2,3],'B':pd.Categorical([True, None, False])})

In [10]: df
Out[10]: 
   A      B
0  1   True
1  2    NaN
2  3  False

In [11]: df[np.array(df.B.fillna(False))]
Out[11]: 
   A     B
0  1  True

In [12]: df[df.B]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-1757d22fed69> in <module>()
----> 1 df[df.B]

/Users/jreback/pandas/pandas/core/frame.py in __getitem__(self, key)
   2061         if isinstance(key, (Series, np.ndarray, Index, list)):
   2062             # either boolean or fancy integer index
-> 2063             return self._getitem_array(key)
   2064         elif isinstance(key, DataFrame):
   2065             return self._getitem_frame(key)

/Users/jreback/pandas/pandas/core/frame.py in _getitem_array(self, key)
   2105             return self.take(indexer, axis=0, convert=False)
   2106         else:
-> 2107             indexer = self.loc._convert_to_indexer(key, axis=1)
   2108             return self.take(indexer, axis=1, convert=True)
   2109 

/Users/jreback/pandas/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1228                 mask = check == -1
   1229                 if mask.any():
-> 1230                     raise KeyError('%s not in index' % objarr[mask])
   1231 
   1232                 return _values_from_object(indexer)

KeyError: '[True nan False] not in index'

jorisvandenbossche · 2017-04-04T14:05:24Z

I think you are using the wrong indexer (df[df.B] tries to select columns). Eg

In [26]: df = pd.DataFrame({'A':[1,2,3],'B':pd.Categorical([True, None, False])})

In [27]: df.A[df.B]
Out[27]: 
B
True     2.0
NaN      NaN
False    1.0
Name: A, dtype: float64

does not work as you want (this should actually raise I think because you shouldn't do boolean indexing with NaN values)

jorisvandenbossche · 2017-04-04T14:10:23Z

Sorry, I was confused myself, of course something like df[df.B] should do boolean indexing if column B is of bool dtype. But the example above is still something that does not work (but can be solved in our indexing code of course).
But a problem is eg that (currently at least) np.asarray(boolean_cat) gives you an object array instead of boolean array, and numpy complains about that in certain cases.

jreback · 2017-04-04T14:10:43Z

@jorisvandenbossche yeah not saying this should work when there are NaN's embedded anyhow :)

but yes this works now

In [17]: df = pd.DataFrame({'A':[1,2],'B':pd.Categorical([True, False])})

In [18]: df
Out[18]: 
   A      B
0  1   True
1  2  False

In [19]: df.loc[df.B]
Out[19]: 
       A      B
True   2  False
False  1   True

In [20]: df.A[df.B]
Out[20]: 
B
True     2
False    1
Name: A, dtype: int64

Though I think we should actualy have a BooleanIndex first :>

In [30]: df.B.cat.categories
Out[30]: Index([False, True], dtype='object')

jreback · 2017-04-04T14:11:22Z

But a problem is eg that (currently at least) np.asarray(boolean_cat) gives you an object array instead of boolean array, and numpy complains about that in certain cases.

This is related to the not having BooleanIndex

jorisvandenbossche · 2020-01-28T09:48:56Z

Given that we now have a BooleanDtype that supports missing values (although with another approach of an additional mask instead of -1 for the missing values), I don't think this is still something we want to pursue?

jreback · 2020-01-28T10:01:58Z

yeah this is not necessary anymore, though it’s slightly more efficient in memory compared to the current (and future implementation with a bit mask)

but not worth the effort

jreback added API Design Dtype Conversions Unexpected or buggy dtype conversions labels Apr 4, 2017

jreback self-assigned this Apr 4, 2017

jreback mentioned this issue Apr 4, 2017

API: support a BooleanIndex #15890

Closed

jreback closed this as completed Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DTYPE: use a Categorical for bool dtypes #15888

DTYPE: use a Categorical for bool dtypes #15888

jreback commented Apr 4, 2017

chris-b1 commented Apr 4, 2017

jorisvandenbossche commented Apr 4, 2017

jreback commented Apr 4, 2017

jorisvandenbossche commented Apr 4, 2017

jorisvandenbossche commented Apr 4, 2017

jreback commented Apr 4, 2017

jreback commented Apr 4, 2017

jorisvandenbossche commented Jan 28, 2020

jreback commented Jan 28, 2020 •

edited

Loading

DTYPE: use a Categorical for bool dtypes #15888

DTYPE: use a Categorical for bool dtypes #15888

Comments

jreback commented Apr 4, 2017

chris-b1 commented Apr 4, 2017

jorisvandenbossche commented Apr 4, 2017

jreback commented Apr 4, 2017

jorisvandenbossche commented Apr 4, 2017

jorisvandenbossche commented Apr 4, 2017

jreback commented Apr 4, 2017

jreback commented Apr 4, 2017

jorisvandenbossche commented Jan 28, 2020

jreback commented Jan 28, 2020 • edited Loading

jreback commented Jan 28, 2020 •

edited

Loading