-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: support the Arrow PyCapsule Interface for importing data #59631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
xref #54057 where a user expected I would be +1 for a |
Do you know why the pycapsule interface chose not to specify anything around imports? I vaguely recall some upstream conversations about that but not sure where it landed My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like |
polars has a module-level The main problem with a module-level My PR didn't touch the module level |
It says 'This is left up to individual libraries". For example, polars now uses (now, while we speak about a public import method, it might certainly be a valid question whether there should be a protocol for import as well, so that you could roundtrip, but that's a different topic I think)
Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated
Interesting reference. Personally, I think that by default a method to consume arrow data should also return default data types (and not ArrowDtype). We can give users control about that, though (like with the dtype backend in other IO methods) |
I think because this blurs the line between the PyCapsule interface as an exchange mechanism and that same interface as an end-user API. I'm of the impression our target audience is other developers and their libraries, not necessarily an end user using this like its an I/O method |
To put a real use case, I've had a need for this in a library I created called pantab: At least from the perspective of that library, I ideally would want the dataframe libraries to all have one consistent interface. That way, my third party library could just say "ok, whatever dataframe library you are using, I'm just going to send this capsule through to X and you will get back the result you want" If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today |
The PyCapsule Interface is focused on use cases around importing some foreign data to your library. I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects. In your case where you have import polars as pl
from arro3.compute import take
import pyarrow.parquet as pq
# Creates a polars object
df = pl.DataFrame({...})
# understands the polars object via C Stream
# returns an arro3 RecordBatchReader
filtered = take(df, [1, 4, 2, 5])
# understands the arro3 object via C Stream
pq.write_parquet(filtered)
In particular, my argument is that an arrow producer should not choose the user-facing API but rather just expose the data protocol. Then the user can choose how to import the data as they wish. |
Absolutely. To be clear, that code was from 7 months ago, before any library (except for pyarrow) started supporting imports. I am definitely trying to solve that pattern, not promote its usage
Is the python capsule available at runtime? I thought it was just for extension authors and not really even inspectable (i.e. can you even do an I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules? |
Sorry, by "pycapsule objects" I meant to say "instances of classes that have Arrow PyCapsule Interface dunder methods and can export PyCapsules".
Well, that's why I created arro3 😄. I wanted a lightweight (~7MB compared to pyarrow's >100MB) library that can manage Arrow data in a compliant way between libraries, but with nicer high-level APIs than nanoarrow. It has wheels for every platform, including pyodide. |
Well I don't want to try and boil the ocean here, but I wonder if we don't require pyarrow that we shouldn't look at requiring arro3 as a fallback. I think there's good value in having another library provide a consistent object like a RecordBatchReader for data exchange like this, and we could just accept that in our series / dataframe constructor, rather than building that ourselves |
Well, I'd say the point of arro3 is to handle cases like this. But at the same time stable enough to be a required pandas dependency is a pretty high bar... I'd say that in managing Arrow data, arro3 is relatively stable, but that in managing interop with pandas and numpy it's less stable. |
FWIW a user on Slack today mentioned that calling the constructor So it might be tough to avoid direct constructor support for the Series, if that is something we ended up being averse to |
Adding MRE for the example above
|
It looks like the python-oracledb library just implemented support for data exchange over the PyCapsule interface: https://python-oracledb.readthedocs.io/en/latest/user_guide/sql_execution.html#fetching-data-frames Their documented way to get a dataframe from a capsule is to do something like: import pandas
import pyarrow
# Get an OracleDataFrame
# Adjust arraysize to tune the query fetch performance
sql = "select * from mytable where id = :1"
myid = 12345 # the bind variable value
odf = connection.fetch_df_all(statement=sql, parameters=[myid], arraysize=1000)
# Get a Pandas DataFrame from the data.
df = pyarrow.Table.from_arrays(
odf.column_arrays(), names=odf.column_names()
).to_pandas() The more I have thought about it, I've learned towards thinking acceptance of an object that exposes the PyArrow Capsule dunder in a dataframe constructor would probably make for a cleaner API, which would change the above code to: import pandas
import pyarrow
# Get an OracleDataFrame
# Adjust arraysize to tune the query fetch performance
sql = "select * from mytable where id = :1"
myid = 12345 # the bind variable value
odf = connection.fetch_df_all(statement=sql, parameters=[myid], arraysize=1000)
# Get a Pandas DataFrame from the data.
df = pd.DataFrame(odf) |
FWIW
That can be just pyarrow.table(odf).to_pandas() as long as |
Yea true. There's definitely some nits to be picked in that documentation...but I think they generally are trying to be consistent with how to make a pandas dataframe versus polars |
We have #56587 and #59518 now for exporting pandas DataFrame and Series through the Arrow PyCapsule Interface (i.e. adding
__arrow_c_stream__
methods), but we don't yet have the import counterpart.For importing, the specification doesn't provide any API guidelines on what this should look like, so we have a couple of options. The two main ones I can think of:
from_arrow()
method, which could be top level (pd.from_arrow(..)
) or as class methods (pd.DataFrame.from_arrow(..)
)pd.Dataframe(..)
)In pandas itself, we do have a couple of
from_..
class methods (from_dict
/from_records
), but often for objects we also allow in the main constructor (at least for the dict case), but I think the main differentiator is that the specific class methods then have more specialized keyword arguments (and therefore allow a larger variety of input).So based on that pattern, we could also do both: add a
DataFrame.from_arrow()
class method, and then also accept such objects inpd.DataFrame()
, passing through tofrom_arrow()
(which could have more custom options to control how the conversion from arrow to pandas exactly is done).Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both.
pl.from_arrow
already existed but might be more specific for pyarrow? And then pola-rs/polars#17693 added it to the mainpl.DataFrame(..)
constructor (@kylebarron)For geopandas, I added a
GeoDataFrame.from_arrow()
method.(to be clear, everything said above also applies to
Series()
/Series.from_arrow()
etc)cc @MarcoGorelli @WillAyd
The text was updated successfully, but these errors were encountered: