Skip to content

WIP: Bitarray backed Int EAs #22238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 7, 2018

First pass at #21839 and most likely a precursor to #22226.

Very hacky at the moment but was able to get tests to pass locally (will fail CI given new Bitarray depdencendy). Going to work through a refactor of this but sharing as POC in case anyone has feedback.

Generally the hard part here was working around places in the code base that expect an ndarray (ex: via sum method calls) and figuring out how / when to best manage the conversions between bit arrays and ndarrays

@pep8speaks
Copy link

pep8speaks commented Aug 7, 2018

Hello @WillAyd! Thanks for updating the PR.

Line 1295:1: W293 blank line contains whitespace

Comment last updated on December 28, 2018 at 21:50 Hours UTC

@WillAyd WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 7, 2018
@TomAugspurger
Copy link
Contributor

How active is bitarray development? Would it make sense to just vendor it since it won’t be exposed to users?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 8, 2018

Based on it's pulse it has arguably been more active as of late, though at the same time it does mostly look backed by one person:

image

I'm not opposed to vendoring if we feel like it's easier to support that way

Copy link
Member Author

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit should be a lot cleaner than before. Again this will fail on CI due to missing dependency, but curious what others think about this change and if it's worth making bit array a dependency for the Int EAs (and assumedly bool going forward).

Memory savings here can be pretty significant

@@ -195,6 +198,23 @@ def coerce_to_array(values, dtype, mask=None, copy=False):
return values, mask


def _numpy_to_bitarray(arr):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the method below maybe belong in a shared module, but placing here now for better change visibility

@@ -1172,6 +1173,13 @@ def assert_extension_array_equal(left, right):
assert left.dtype == right.dtype
left_na = left.isna()
right_na = right.isna()

# TODO - maybe generate dedicated method for bitarray comparison?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says it all - we maybe need a dedicated method for comparing bit arrays. At the very least there's got to be a more robust way of handling this in the extension check

@@ -988,7 +988,10 @@ def _set_with(self, key, value):
else:
return self._set_values(key, value)
elif key_type == 'boolean':
self._set_values(key.astype(np.bool_), value)
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few ways to do this but this was raising when any non-NumPy arrays were hitting this branch. Not directly related to change because it would fail for lists getting here but came to light as bitarrays were hitting this condition.

@WillAyd WillAyd changed the title WIP - Bitarray backed Int EAs Bitarray backed Int EAs Aug 11, 2018
@jreback
Copy link
Contributor

jreback commented Nov 21, 2018

nice idea, this would have to be an optional dep i think, so makes supporting this a bit more tricky. closing as stale.

@jreback jreback closed this Nov 21, 2018
@eyadsibai
Copy link

@WillAyd I was watching ur talk about this PR. Any chance this come to life soon?

@WillAyd
Copy link
Member Author

WillAyd commented Dec 27, 2018

Haven't put much into it lately but if you'd like to pick up where I left off @eyadsibai would certainly be welcome!

@WillAyd
Copy link
Member Author

WillAyd commented Dec 28, 2018

Reopening as I should be able to put more focus in on this one

@WillAyd WillAyd reopened this Dec 28, 2018
@WillAyd
Copy link
Member Author

WillAyd commented Dec 28, 2018

Just merged master though not quite ready for review (providing in case someone else is interested).

@jreback was thinking about the best way to make this optional and was looking at what we do with bottleneck as inspiration. IIUC we have a wrapper for that which intercepts method calls and replaces them with the bottleneck equivalent where applicable.

Was thinking we could do relatively the same thing here, though not sure if you had other thoughts on maybe creating some kind of mask mixin since this could be utilized by perhaps any EA that needs a NA mask

@jreback
Copy link
Contributor

jreback commented Dec 28, 2018

yeah you could make a wrapper class that abstracts this (meaning it would use bitarray if available or a numpy uint_8 if not)( eventually maybe using a pyarrow as another alternative)

you could put in pandas/core/arrays/_mask.py i think

it wouldn’t be a full extension array itself but just expose a way to create a suitable array and work with it

then this can just be a drop in replacement for it in integer_na

i would opt for composition here rather than a mixin

@WillAyd WillAyd changed the title Bitarray backed Int EAs WIP: Bitarray backed Int EAs Jan 4, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Jan 4, 2019

Nowhere near ready for merge and causes tons of failures. Updated title to show as WIP.

Latest commit provided just for any feedback on overall structure

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a base class which is numpy and s subclass for a BitArrayMask; then creat via factory function (module level) which does the imports (see how we do this in parquet.py)

I am not sure when conversion to numpy could / should happen - how expensive is this?

@WillAyd
Copy link
Member Author

WillAyd commented Jan 31, 2019

Closing again for now; will re-open once I can get some more time to actively invest, though certainly if anyone else wants to pick this up open to that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants