-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
EA: BoolArray #25415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EA: BoolArray #25415
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25415 +/- ##
===========================================
- Coverage 91.73% 41.64% -50.09%
===========================================
Files 173 175 +2
Lines 52845 53041 +196
===========================================
- Hits 48479 22091 -26388
- Misses 4366 30950 +26584
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25415 +/- ##
===========================================
- Coverage 91.98% 40.72% -51.27%
===========================================
Files 175 179 +4
Lines 52371 52764 +393
===========================================
- Hits 48175 21488 -26687
- Misses 4196 31276 +27080
Continue to review full report at Codecov.
|
so this could be simplfiied if we make pyarrow a hard dependency :-D, then we don't need the numpy compat layer. |
|
||
class BoolArray(ExtensionArray): | ||
"""Common baseclass for both pyarrow and numpy masked arrays""" | ||
_typ = "maskarray" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this make isinstance(BoolArray(), ABCExtensionArray)
false?
return len(self) | ||
|
||
def __eq__(self, other): | ||
return np.array(self, copy=False) == np.array(other, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be good to keep a list of things requiring a cast to NumPy at the top of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And perhaps make JIRAs for them as well.
This, isnull()
, unary / binops, reductions.
a1e8d74
to
b193658
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much to say on the implementation, but noticed a confusing fixture name...
return hash(self) == hash(other) | ||
|
||
|
||
class MaskArray(ExtensionArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question than anything else but is there a reason for defining a lot of the methods in this base class rather than in subclasses? Not terribly familiar with pyarrow yet but would the goal not be to decouple that from numpy here in the long run?
beff364
to
b87e320
Compare
fixes a few buggies. |
9afeee7
to
dbb3876
Compare
Shared methods to simplify implementation Added try...except for non-ndarray Series _set_with Minor fixups Reverted changes; created new module for mask Reverted test from master Revert unnecessary changes from master Added iter to mask Added more dunders More properties / methods
Do we want to do this for 0.25? |
this really needs pyarrow as a hard dep |
Ah yeah I forgot about that complication. OK with pushing for now then.
…On Tue, May 21, 2019 at 12:03 PM Jeff Reback ***@***.***> wrote:
this really needs pyarrow as a hard dep
but that needs some discussion
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25415?email_source=notifications&email_token=AAKAOIUFEY7DGZCDEPEGYG3PWQTMPA5CNFSM4GZIJUCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4RRBQ#issuecomment-494475398>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRDO263BNLMTFYTVBDPWQTMPANCNFSM4GZIJUCA>
.
|
It's also not fully clear to me what the goal / scope is of this PR. It seems to introduce a MaskArray / MaskDtype, but is that the way we want to expose this to the user? (instead of BooleanArray?) Or is it not meant for general usage, but only for the mask of the IntegerArray? But then why make this into a full ExtensionArray? |
so this PR serves a couple of cases:
cc @pandas-dev/pandas-core |
Thoughts on when we would want to take on pyarrow as a required dependency?
After 1.0?
…On Thu, May 23, 2019 at 7:29 AM Jeff Reback ***@***.***> wrote:
so this PR serves a couple of cases:
- formalize using a MaskType backed by pyarrow / numpy boolean
(non-nullable) types (based off of @TomAugspurger
<https://github.com/TomAugspurger> original example); basically seeing
what indexing things pop up when we can return a mask type that is not a
numpy boolean array (it works pretty reasonably actually)
- use the MaskType as a backing for the mask in Integer implementation
to save memory (1 bit per vs 1 byte now); this also works reasonably well
- see if I could do the above optionally using pyarrow. The
implementation does work, but I had to jump thru quite a few hoops. I think
we should simply adopt pyarrow as a formal dependency and this (and
other things) become quite a bit easier to do and we can then start using
pyarrow kernels and other machinery (hash tables perhaps) in the pandas
internals.
cc @pandas-dev/pandas-core
<https://github.com/orgs/pandas-dev/teams/pandas-core>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25415?email_source=notifications&email_token=AAKAOIX2Q64HEFKWHGAYDXDPW2E2XA5CNFSM4GZIJUCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWCB6OY#issuecomment-495198011>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOITLIHXYYEFQVBBX45TPW2E2XANCNFSM4GZIJUCA>
.
|
closing for now. |
this builds on #22226 and #22238 by @WillAyd and the
ArrowBoolArray
by @TomAugspurger to create a first class bool array that is backed by a numpy EA array or a pyarrow array if available. Uses this MaskArray as the mask inside IntegerArray. The is a POC to see how feasible this is generally. Note that I did this withbitarray
as well, but this is more general purpose / forward looking.Some decent memory savings as the bits are packed by arrow.
PR:
0.24.1 / master
TODO:
pyarrow
usespandas
as a dependency, so we have a hoop to jump to actually have this import in a sane waypd.bool_
dtype to avoid exposing the pyarrow dtypes directly (mainly because we have this as an optional dep and now BoolArray can either benp.bool_
orarrow_bool
so need a single type to expose to end users; this is not hard just not done).BoolArray
allows NaN's when its an ArrowBoolArray but not for a NumpyBoolArray; I avoided this entirely but just using this as a masking array (so its guaranteed as only bool types), but eventually want to allow thisShould just renameThis is now internal only,MaskArray
toBoolArray
I think.BoolArray
will use this.cc @pandas-dev/pandas-core