-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP: Bitarray backed Int EAs #22238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Bitarray backed Int EAs #22238
Conversation
Hello @WillAyd! Thanks for updating the PR.
Comment last updated on December 28, 2018 at 21:50 Hours UTC |
How active is bitarray development? Would it make sense to just vendor it since it won’t be exposed to users? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest commit should be a lot cleaner than before. Again this will fail on CI due to missing dependency, but curious what others think about this change and if it's worth making bit array a dependency for the Int EAs (and assumedly bool going forward).
Memory savings here can be pretty significant
pandas/core/arrays/integer.py
Outdated
@@ -195,6 +198,23 @@ def coerce_to_array(values, dtype, mask=None, copy=False): | |||
return values, mask | |||
|
|||
|
|||
def _numpy_to_bitarray(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the method below maybe belong in a shared module, but placing here now for better change visibility
pandas/util/testing.py
Outdated
@@ -1172,6 +1173,13 @@ def assert_extension_array_equal(left, right): | |||
assert left.dtype == right.dtype | |||
left_na = left.isna() | |||
right_na = right.isna() | |||
|
|||
# TODO - maybe generate dedicated method for bitarray comparison? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says it all - we maybe need a dedicated method for comparing bit arrays. At the very least there's got to be a more robust way of handling this in the extension check
pandas/core/series.py
Outdated
@@ -988,7 +988,10 @@ def _set_with(self, key, value): | |||
else: | |||
return self._set_values(key, value) | |||
elif key_type == 'boolean': | |||
self._set_values(key.astype(np.bool_), value) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few ways to do this but this was raising when any non-NumPy arrays were hitting this branch. Not directly related to change because it would fail for lists getting here but came to light as bitarrays were hitting this condition.
nice idea, this would have to be an optional dep i think, so makes supporting this a bit more tricky. closing as stale. |
@WillAyd I was watching ur talk about this PR. Any chance this come to life soon? |
Haven't put much into it lately but if you'd like to pick up where I left off @eyadsibai would certainly be welcome! |
Reopening as I should be able to put more focus in on this one |
Just merged master though not quite ready for review (providing in case someone else is interested). @jreback was thinking about the best way to make this optional and was looking at what we do with bottleneck as inspiration. IIUC we have a wrapper for that which intercepts method calls and replaces them with the bottleneck equivalent where applicable. Was thinking we could do relatively the same thing here, though not sure if you had other thoughts on maybe creating some kind of mask mixin since this could be utilized by perhaps any EA that needs a NA mask |
yeah you could make a wrapper class that abstracts this (meaning it would use bitarray if available or a numpy uint_8 if not)( eventually maybe using a pyarrow as another alternative) you could put in pandas/core/arrays/_mask.py i think it wouldn’t be a full extension array itself but just expose a way to create a suitable array and work with it then this can just be a drop in replacement for it in integer_na i would opt for composition here rather than a mixin |
Nowhere near ready for merge and causes tons of failures. Updated title to show as WIP. Latest commit provided just for any feedback on overall structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a base class which is numpy and s subclass for a BitArrayMask; then creat via factory function (module level) which does the imports (see how we do this in parquet.py)
I am not sure when conversion to numpy could / should happen - how expensive is this?
Closing again for now; will re-open once I can get some more time to actively invest, though certainly if anyone else wants to pick this up open to that! |
First pass at #21839 and most likely a precursor to #22226.
Very hacky at the moment but was able to get tests to pass locally (will fail CI given new Bitarray depdencendy). Going to work through a refactor of this but sharing as POC in case anyone has feedback.
Generally the hard part here was working around places in the code base that expect an ndarray (ex: via
sum
method calls) and figuring out how / when to best manage the conversions between bit arrays and ndarrays