-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add StringArray.__arrow_array__ for conversion to Arrow #29182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add StringArray.__arrow_array__ for conversion to Arrow #29182
Conversation
doc/source/whatsnew/v1.0.0.rst
Outdated
0.15.0), which means that it is supported in writing to the Parquet file | ||
format when using the ``pyarrow`` engine. It is currently not yet supported | ||
when converting back to pandas (so it will become an integer or float | ||
(depending on the presence of missing data) or object dtype column). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maya have an extra parenthesis here.
@@ -182,6 +182,16 @@ def _from_sequence(cls, scalars, dtype=None, copy=False): | |||
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False): | |||
return cls._from_sequence(strings, dtype=dtype, copy=copy) | |||
|
|||
def __arrow_array__(self, type=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we make a new base EA mixin class for this? e.g. this is pretty generic code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and have String/Integer array use this? (maybe actually just put in the EA base class itself?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this one is different than the one in IntegerArray, so it cannot directly be shared with that. I thought about putting this in the PandasArray class, but the StringArray is the only subclass of it that is actually used. So therefore I kept it here. But open to other options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe actually just put in the EA base class itself?
That's not possible, because it depends on the internals (eg StringArray uses _ndarray
for the data, IntegerArray
uses _data
and _mask
). The only thing that could be put in the base EA class is the signature with a NotImplementedError, but I need to check how that interacts with pyarrow (as the code there does a hasattr check, so if the method is present, it expects it to work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool. yeah this can certainly be handled in a followup; but i think we DO want to generally enable these types of conversions.
The failing check is codecov, so ignoring that |
However, we should probably ensure we are testing somewhere pyarrow 0.15 / latest release (from a quick look that doesn't seem to be included on any build). Will open an issue for that. |
Similar as #28368 (which did this for IntegerArray)