Skip to content

EA: BoolArray #25415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

EA: BoolArray #25415

wants to merge 2 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Feb 22, 2019

this builds on #22226 and #22238 by @WillAyd and the ArrowBoolArray by @TomAugspurger to create a first class bool array that is backed by a numpy EA array or a pyarrow array if available. Uses this MaskArray as the mask inside IntegerArray. The is a POC to see how feasible this is generally. Note that I did this with bitarray as well, but this is more general purpose / forward looking.

Some decent memory savings as the bits are packed by arrow.

PR:

In [2]: s = pd.Series(np.arange(100000), dtype='Int64')                                                                                                                                                                                 

In [3]: s.nbytes                                                                                                                                                                                                                        
Out[3]: 812500

In [4]: s._values._mask.nbytes                                                                                                                                                                                                          
Out[4]: 12500

0.24.1 / master

In [2]: s = pd.Series(np.arange(100000), dtype='Int64')                                                                                                                                                                                 

In [3]: s.nbytes                                                                                                                                                                                                                        
Out[3]: 900000

In [4]: s._values._mask.nbytes                                                                                                                                                                                                          
Out[4]: 100000

TODO:

  • pyarrow uses pandas as a dependency, so we have a hoop to jump to actually have this import in a sane way
  • we need a pd.bool_ dtype to avoid exposing the pyarrow dtypes directly (mainly because we have this as an optional dep and now BoolArray can either be np.bool_ or arrow_bool so need a single type to expose to end users; this is not hard just not done).
  • The BoolArray allows NaN's when its an ArrowBoolArray but not for a NumpyBoolArray; I avoided this entirely but just using this as a masking array (so its guaranteed as only bool types), but eventually want to allow this
  • A number of missing methods on ArrowBoolArray that we should push upstream (implemented now by simply coercing and using numpy)
  • Should just rename MaskArray to BoolArray I think. This is now internal only, BoolArray will use this.

cc @pandas-dev/pandas-core

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Feb 22, 2019
@pep8speaks
Copy link

pep8speaks commented Feb 22, 2019

Hello @jreback! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-21 17:30:58 UTC

@codecov
Copy link

codecov bot commented Feb 22, 2019

Codecov Report

Merging #25415 into master will decrease coverage by 50.08%.
The diff coverage is 45.73%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25415       +/-   ##
===========================================
- Coverage   91.73%   41.64%   -50.09%     
===========================================
  Files         173      175        +2     
  Lines       52845    53041      +196     
===========================================
- Hits        48479    22091    -26388     
- Misses       4366    30950    +26584
Flag Coverage Δ
#multiple ?
#single 41.64% <45.73%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/common.py 67.2% <100%> (-31.19%) ⬇️
pandas/core/arrays/integer.py 41.33% <100%> (-54.99%) ⬇️
pandas/core/dtypes/generic.py 100% <100%> (ø) ⬆️
pandas/core/arrays/bool.py 10.25% <15.9%> (ø)
pandas/core/arrays/_mask.py 57.33% <57.33%> (ø)
pandas/core/indexing.py 52.34% <66.66%> (-39.95%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b673188...d539290. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 22, 2019

Codecov Report

Merging #25415 into master will decrease coverage by 51.26%.
The diff coverage is 44.39%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25415       +/-   ##
===========================================
- Coverage   91.98%   40.72%   -51.27%     
===========================================
  Files         175      179        +4     
  Lines       52371    52764      +393     
===========================================
- Hits        48175    21488    -26687     
- Misses       4196    31276    +27080
Flag Coverage Δ
#multiple ?
#single 40.72% <44.39%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 23.9% <0%> (-73.33%) ⬇️
pandas/core/common.py 67.74% <100%> (-30.65%) ⬇️
pandas/core/nanops.py 31.52% <100%> (-62.31%) ⬇️
pandas/core/dtypes/generic.py 100% <100%> (ø) ⬆️
pandas/core/internals/blocks.py 51.06% <16.66%> (-43.03%) ⬇️
pandas/core/arrays/mask/_pyarrow.py 21.15% <21.15%> (ø)
pandas/core/arrays/mask/_base.py 44.18% <44.18%> (ø)
pandas/core/arrays/integer.py 43.93% <50%> (-52.42%) ⬇️
pandas/core/arrays/mask/_numpy.py 55.81% <55.81%> (ø)
pandas/core/indexing.py 52.25% <66.66%> (-38.64%) ⬇️
... and 150 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61edc76...d030c57. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Feb 22, 2019

so this could be simplfiied if we make pyarrow a hard dependency :-D, then we don't need the numpy compat layer.


class BoolArray(ExtensionArray):
"""Common baseclass for both pyarrow and numpy masked arrays"""
_typ = "maskarray"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make isinstance(BoolArray(), ABCExtensionArray) false?

return len(self)

def __eq__(self, other):
return np.array(self, copy=False) == np.array(other, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be good to keep a list of things requiring a cast to NumPy at the top of this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And perhaps make JIRAs for them as well.

This, isnull(), unary / binops, reductions.

@jreback jreback force-pushed the mask branch 2 times, most recently from a1e8d74 to b193658 Compare February 24, 2019 03:14
Copy link
Contributor

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much to say on the implementation, but noticed a confusing fixture name...

return hash(self) == hash(other)


class MaskArray(ExtensionArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question than anything else but is there a reason for defining a lot of the methods in this base class rather than in subclasses? Not terribly familiar with pyarrow yet but would the goal not be to decouple that from numpy here in the long run?

@jreback jreback force-pushed the mask branch 2 times, most recently from beff364 to b87e320 Compare March 26, 2019 01:24
@jreback
Copy link
Contributor Author

jreback commented Mar 26, 2019

fixes a few buggies.

@jreback jreback changed the title WIP/EA: BoolArray EA: BoolArray Mar 26, 2019
@jreback jreback force-pushed the mask branch 2 times, most recently from 9afeee7 to dbb3876 Compare April 5, 2019 02:34
WillAyd and others added 2 commits April 21, 2019 12:56
Shared methods to simplify implementation

Added try...except for non-ndarray Series _set_with

Minor fixups

Reverted changes; created new module for mask

Reverted test from master

Revert unnecessary changes from master

Added iter to mask

Added more dunders

More properties / methods
@TomAugspurger
Copy link
Contributor

Do we want to do this for 0.25?

@jreback
Copy link
Contributor Author

jreback commented May 21, 2019

this really needs pyarrow as a hard dep
but that needs some discussion

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 21, 2019 via email

@jorisvandenbossche
Copy link
Member

It's also not fully clear to me what the goal / scope is of this PR. It seems to introduce a MaskArray / MaskDtype, but is that the way we want to expose this to the user? (instead of BooleanArray?) Or is it not meant for general usage, but only for the mask of the IntegerArray? But then why make this into a full ExtensionArray?

@jreback
Copy link
Contributor Author

jreback commented May 23, 2019

so this PR serves a couple of cases:

  • formalize using a MaskType backed by pyarrow / numpy boolean (non-nullable) types (based off of @TomAugspurger original example); basically seeing what indexing things pop up when we can return a mask type that is not a numpy boolean array (it works pretty reasonably actually)
  • use the MaskType as a backing for the mask in Integer implementation to save memory (1 bit per vs 1 byte now); this also works reasonably well
  • see if I could do the above optionally using pyarrow. The implementation does work, but I had to jump thru quite a few hoops. I think we should simply adopt pyarrow as a formal dependency and this (and other things) become quite a bit easier to do and we can then start using pyarrow kernels and other machinery (hash tables perhaps) in the pandas internals.

cc @pandas-dev/pandas-core

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 23, 2019 via email

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2019

closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants