Skip to content

DTYPE: use a Categorical for bool dtypes #15888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Apr 4, 2017 · 9 comments
Closed

DTYPE: use a Categorical for bool dtypes #15888

jreback opened this issue Apr 4, 2017 · 9 comments
Assignees
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions

Comments

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

this may seem counter intutive, but this would allow us to store nulls as well (efficiently, rather than as object dtype), or casting to floats.

I am sure if this would really be possible w/o some API breaks, so will have a look.

@jreback jreback added API Design Dtype Conversions Unexpected or buggy dtype conversions labels Apr 4, 2017
@jreback jreback self-assigned this Apr 4, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Apr 4, 2017

Nice idea, xref #15751 - I was sort of backing into the same concept

@jorisvandenbossche
Copy link
Member

Not sure that will work with boolean indexing

@jreback
Copy link
Contributor Author

jreback commented Apr 4, 2017

It works, but would need some internal modifications to make this compat

In [9]: df = pd.DataFrame({'A':[1,2,3],'B':pd.Categorical([True, None, False])})

In [10]: df
Out[10]: 
   A      B
0  1   True
1  2    NaN
2  3  False

In [11]: df[np.array(df.B.fillna(False))]
Out[11]: 
   A     B
0  1  True

In [12]: df[df.B]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-1757d22fed69> in <module>()
----> 1 df[df.B]

/Users/jreback/pandas/pandas/core/frame.py in __getitem__(self, key)
   2061         if isinstance(key, (Series, np.ndarray, Index, list)):
   2062             # either boolean or fancy integer index
-> 2063             return self._getitem_array(key)
   2064         elif isinstance(key, DataFrame):
   2065             return self._getitem_frame(key)

/Users/jreback/pandas/pandas/core/frame.py in _getitem_array(self, key)
   2105             return self.take(indexer, axis=0, convert=False)
   2106         else:
-> 2107             indexer = self.loc._convert_to_indexer(key, axis=1)
   2108             return self.take(indexer, axis=1, convert=True)
   2109 

/Users/jreback/pandas/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1228                 mask = check == -1
   1229                 if mask.any():
-> 1230                     raise KeyError('%s not in index' % objarr[mask])
   1231 
   1232                 return _values_from_object(indexer)

KeyError: '[True nan False] not in index'

@jorisvandenbossche
Copy link
Member

I think you are using the wrong indexer (df[df.B] tries to select columns). Eg

In [26]: df = pd.DataFrame({'A':[1,2,3],'B':pd.Categorical([True, None, False])})

In [27]: df.A[df.B]
Out[27]: 
B
True     2.0
NaN      NaN
False    1.0
Name: A, dtype: float64

does not work as you want (this should actually raise I think because you shouldn't do boolean indexing with NaN values)

@jorisvandenbossche
Copy link
Member

Sorry, I was confused myself, of course something like df[df.B] should do boolean indexing if column B is of bool dtype. But the example above is still something that does not work (but can be solved in our indexing code of course).
But a problem is eg that (currently at least) np.asarray(boolean_cat) gives you an object array instead of boolean array, and numpy complains about that in certain cases.

@jreback
Copy link
Contributor Author

jreback commented Apr 4, 2017

@jorisvandenbossche yeah not saying this should work when there are NaN's embedded anyhow :)

but yes this works now

In [17]: df = pd.DataFrame({'A':[1,2],'B':pd.Categorical([True, False])})

In [18]: df
Out[18]: 
   A      B
0  1   True
1  2  False

In [19]: df.loc[df.B]
Out[19]: 
       A      B
True   2  False
False  1   True

In [20]: df.A[df.B]
Out[20]: 
B
True     2
False    1
Name: A, dtype: int64

Though I think we should actualy have a BooleanIndex first :>

In [30]: df.B.cat.categories
Out[30]: Index([False, True], dtype='object')

@jreback
Copy link
Contributor Author

jreback commented Apr 4, 2017

But a problem is eg that (currently at least) np.asarray(boolean_cat) gives you an object array instead of boolean array, and numpy complains about that in certain cases.

This is related to the not having BooleanIndex

@jorisvandenbossche
Copy link
Member

Given that we now have a BooleanDtype that supports missing values (although with another approach of an additional mask instead of -1 for the missing values), I don't think this is still something we want to pursue?

@jreback
Copy link
Contributor Author

jreback commented Jan 28, 2020

yeah this is not necessary anymore, though it’s slightly more efficient in memory compared to the current (and future implementation with a bit mask)

but not worth the effort

@jreback jreback closed this as completed Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

3 participants