-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -205,6 +205,8 @@ | |
AnyAll, | ||
AnyArrayLike, | ||
ArrayLike, | ||
ArrowArrayExportable, | ||
ArrowStreamExportable, | ||
Axes, | ||
Axis, | ||
AxisInt, | ||
|
@@ -1746,6 +1748,54 @@ def __rmatmul__(self, other) -> DataFrame: | |
# ---------------------------------------------------------------------- | ||
# IO methods (to / from other formats) | ||
|
||
@classmethod | ||
def from_arrow( | ||
cls, data: ArrowArrayExportable | ArrowStreamExportable | ||
) -> DataFrame: | ||
""" | ||
Construct a DataFrame from a tabular Arrow object. | ||
|
||
This function accepts any tabular Arrow object implementing | ||
the `Arrow PyCapsule Protocol`_ (i.e. having an ``__arrow_c_array__`` | ||
or ``__arrow_c_stream__`` method). | ||
|
||
This function currently relies on ``pyarrow`` to convert the tabular | ||
object in Arrow format to pandas. | ||
|
||
.. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html | ||
|
||
.. versionadded:: 3.0 | ||
|
||
Parameters | ||
---------- | ||
data : pyarrow.Table or Arrow-compatible table | ||
Any tabular object implementing the Arrow PyCapsule Protocol | ||
(i.e. has an ``__arrow_c_array__`` or ``__arrow_c_stream__`` | ||
method). | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
|
||
""" | ||
pa = import_optional_dependency("pyarrow", min_version="14.0.0") | ||
if not isinstance(data, pa.Table): | ||
if not ( | ||
hasattr(data, "__arrow_c_array__") | ||
or hasattr(data, "__arrow_c_stream__") | ||
): | ||
# explicitly test this, because otherwise we would accept variour other | ||
# input types through the pa.table(..) call | ||
raise TypeError( | ||
"Expected an Arrow-compatible tabular object (i.e. having an " | ||
"'_arrow_c_array__' or '__arrow_c_stream__' method), got " | ||
f"'{type(data).__name__}' instead." | ||
) | ||
data = pa.table(data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this actually work for things that only expose In [28]: arr = pa.array([1, 2, 3])
In [29]: hasattr(arr, "__arrow_c_array__")
Out[29]: True
In [30]: pa.table(arr)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[30], line 1
----> 1 pa.table(arr)
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:6022, in pyarrow.lib.table()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:5841, in pyarrow.lib.record_batch()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:3886, in pyarrow.lib.RecordBatch._import_from_c_device_capsule()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowInvalid: Cannot import schema: ArrowSchema describes non-struct type int64 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exposing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And to be fair, RecordBatch has both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah OK that's good to know. So essentially its up to the producer to be able to determine if this makes sense right? I think there is still a consistency problem with how we as a consumer then work. A RecordBatch can be read through both the array and stream interface, but a Table can only be read through the latter (unless it is forced to consolidate chunks and produce an Array). I'm sure PyArrow has that covered well, but unless something gets clarified in the spec expecting array to work a certain way, that might make push libraries into making the (assumedly poor) decision that their streams should also produce consolidated array data There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's say it's up to the consumer to decide if the input makes sense. The producer just says "here's my data". But I think the key added part is user intention. A struct array can represent either one array or a full RecordBatch, and we need a hint from the user for which is which. This is why I couldn't add PyCapsule Interface support to I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
A theoretical example is a library that produces Arrow data thinking that they need to implement There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the spec should be more explicit about when to implement which interface. I think it's implicit that a RecordBatch can implement both, because both are zero copy, but a Table should only implement the stream interface, because only the stream interface is always zero copy. I raised an issue a while ago to discuss consumer implications, if you haven't seen it: apache/arrow#40648 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah OK great - thanks for sharing. I'll track that issue upstream |
||
|
||
df = data.to_pandas() | ||
return df | ||
|
||
@classmethod | ||
def from_dict( | ||
cls, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe this should link to the stream interface page instead? https://arrow.apache.org/docs/format/CStreamInterface.html