-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC of PDEP-9 (I/O plugins) #53005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC of PDEP-9 (I/O plugins) #53005
Changes from 6 commits
c0d0115
91da43a
67a69a9
2439ed9
2b0e13f
000ea21
59b0c3a
b511fe4
51f7588
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
""" | ||
Load I/O plugins from third-party libraries into the pandas namespace. | ||
|
||
Third-party libraries defining I/O plugins register an entrypoint in | ||
the `dataframe.io` group. For example: | ||
|
||
``` | ||
[project.entry-points."dataframe.io"] | ||
repr = "pandas_repr:ReprDataFrameIO" | ||
``` | ||
|
||
The class `ReprDataFrameIO` will implement readers and writers in at | ||
least one of different exchange formats supported by the protocol. | ||
For now a pandas DataFrame or a PyArrow table, in the future probably | ||
nanoarrow, or a Python wrapper to the Arrow Rust implementations. | ||
For example: | ||
|
||
```python | ||
class FancyFormatDataFrameIO: | ||
@staticmethod | ||
def pandas_reader(self, fname): | ||
with open(fname) as f: | ||
return eval(f.read()) | ||
|
||
def pandas_writer(self, fname, mode='w'): | ||
with open(fname, mode) as f: | ||
f.write(repr(self)) | ||
``` | ||
|
||
If the I/O plugin implements a reader or writer supported by pandas, | ||
pandas will create a wrapper function or method to call the reader or | ||
writer from the pandas standard I/O namespaces. For example, for the | ||
entrypoint above with name `repr` and methods `pandas_reader` and | ||
`pandas_writer` pandas will create the next functions and methods: | ||
|
||
- `pandas.read_repr(...)` | ||
- `pandas.Series.to_repr(...)` | ||
- `pandas.DataFrame.to_repr(...)` | ||
|
||
The reader wrappers validates that the returned object is a pandas | ||
DataFrame when the exchange format is `pandas`, and will convert the | ||
other supported objects (e.g. a PyArrow Table) to a pandas DataFrame, | ||
so the result of `pandas.read_repr` is a pandas DataFrame, as the | ||
user would expect. | ||
|
||
If more than one reader or writer with the same name is loaded, pandas | ||
raises an exception. | ||
""" | ||
import functools | ||
from importlib.metadata import entry_points | ||
|
||
import pandas as pd | ||
|
||
supported_exchange_formats = ["pandas"] | ||
|
||
try: | ||
import pyarrow as pa | ||
except ImportError: | ||
pa = None | ||
else: | ||
supported_exchange_formats.append("pyarrow") | ||
|
||
|
||
def _create_reader_function(io_plugin, exchange_format): | ||
""" | ||
Create and return a wrapper function for the original I/O reader. | ||
|
||
We can't directly call the original reader implemented in | ||
the connector, since we want to make sure that the returned value | ||
of `read_<whatever>` is a pandas DataFrame, so we need to validate | ||
it and possibly cast it. | ||
""" | ||
original_reader = getattr(io_plugin, f"{exchange_format}_reader") | ||
|
||
# TODO: Create this function dynamically so the resulting signature contains | ||
# the original parameters and not `*args` and `**kwargs` | ||
@functools.wraps(original_reader) | ||
def reader_wrapper(*args, **kwargs): | ||
result = original_reader(*args, **kwargs) | ||
if exchange_format == "pyarrow": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be be more generic to convert
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's a very good point @mroeschke. Both PyArrow Table and pandas DataFrame support it, so using the interchange protocol in the pandas side of this connectors API seems to solve the problem in a much simpler way. I don't think there shouldn't be performance implications on using it, but maybe worth to check if data is not being copied with the current implementations of the interchange protocol. I updated this PR and the lance-reader repo to use what you suggest, please let me know if this is what you had in mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes perfect! Thanks |
||
result = result.to_pandas() | ||
|
||
# validate output type | ||
if isinstance(result, list): | ||
assert all((isinstance(item, pd.DataFrame) for item in result)) | ||
elif isinstance(result, dict): | ||
assert all( | ||
( | ||
(isinstance(k, str) and isinstance(v, pd.DataFrame)) | ||
twoertwein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for k, v in result.items() | ||
) | ||
) | ||
elif not isinstance(result, pd.DataFrame): | ||
raise AssertionError("Returned object is not a DataFrame") | ||
return result | ||
|
||
# TODO `function.wraps` changes the name of the wrapped function to the | ||
# original `pandas_reader`, change it to the function exposed in pandas. | ||
return reader_wrapper | ||
|
||
|
||
def _create_series_writer_function(format_name): | ||
""" | ||
When calling `Series.to_<whatever>` we call the dataframe writer, so | ||
we need to convert the Series to a one column dataframe. | ||
""" | ||
|
||
def series_writer_wrapper(self, *args, **kwargs): | ||
dataframe_writer = getattr(self.to_frame(), f"to_{format_name}") | ||
dataframe_writer(*args, **kwargs) | ||
|
||
return series_writer_wrapper | ||
|
||
|
||
def load_io_plugins(): | ||
""" | ||
Looks for entrypoints in the `dataframe.io` group and creates the | ||
corresponding pandas I/O methods. | ||
""" | ||
for dataframe_io_entry_point in entry_points().get("dataframe.io", []): | ||
format_name = dataframe_io_entry_point.name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where does this name get defined? Assuming from the name of the library itself? If so maybe worth making this a property of the class so that there is some flexibility for package authors There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the name of the entrypoint. Package authors define it explicitly as the name pandas will use in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any validity to one package providing multiple read/write implementations? An example might be excel where one package offers read_xls alongside read_xlsx, etc... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There shouldn't be any limitation about that |
||
io_plugin = dataframe_io_entry_point.load() | ||
|
||
for exchange_format in supported_exchange_formats: | ||
if hasattr(io_plugin, f"{exchange_format}_reader"): | ||
if hasattr(pd, f"read_{format_name}"): | ||
raise RuntimeError( | ||
"More than one installed library provides the " | ||
"`read_{format_name}` reader. Please uninstall one of " | ||
"the I/O plugins providing connectors for this format." | ||
) | ||
setattr( | ||
pd, | ||
f"read_{format_name}", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
_create_reader_function(io_plugin, exchange_format), | ||
) | ||
|
||
if hasattr(io_plugin, f"{exchange_format}_writer"): | ||
if hasattr(pd.DataFrame, f"to_{format_name}"): | ||
raise RuntimeError( | ||
"More than one installed library provides the " | ||
"`to_{format_name}` reader. Please uninstall one of " | ||
"the I/O plugins providing connectors for this format." | ||
) | ||
setattr( | ||
pd.DataFrame, | ||
f"to_{format_name}", | ||
getattr(io_plugin, f"{exchange_format}_writer"), | ||
) | ||
setattr( | ||
pd.Series, | ||
f"to_{format_name}", | ||
_create_series_writer_function(format_name), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth for a third-party reader/writer to have a way of declaring whether we should "pre-process" a
path
argument?get_handle
currently does a lot of heavy lifting to unify compression, text/binary, str/file-object, and local/remote files across most IO functions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point @twoertwein, thanks for bringing this up. I remember there is a library that does all that in a similar way of what pandas implements. I couldn't find it now, but probably the simplest option is to let connectors use that library and keep this out of the connector API. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinion: for end-users, it would be nice to know that any read/to functions accepts compression/file-handles/pathlib/str but it would also be nice to not make this PDEP more complicated :)