-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC of PDEP-9 (I/O plugins) #53005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC of PDEP-9 (I/O plugins) #53005
Changes from all commits
c0d0115
91da43a
67a69a9
2439ed9
2b0e13f
000ea21
59b0c3a
b511fe4
51f7588
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
""" | ||
Load I/O plugins from third-party libraries into the pandas namespace. | ||
|
||
Third-party libraries defining I/O plugins register an entrypoint in | ||
the `dataframe.io` group. For example: | ||
|
||
``` | ||
[project.entry-points."dataframe.io"] | ||
repr = "pandas_repr:ReprDataFrameIO" | ||
``` | ||
|
||
The class `ReprDataFrameIO` will implement at least one of a reader | ||
and a writer that supports the dataframe interchange protocol: | ||
|
||
https://data-apis.org/dataframe-protocol/latest/API.html | ||
|
||
For example: | ||
|
||
```python | ||
class ReprDataFrameIO: | ||
@staticmethod | ||
def reader(self, fname): | ||
with open(fname) as f: | ||
# for simplicity this assumes eval will create a DataFrame object | ||
return eval(f.read()) | ||
|
||
def writer(self, fname, mode='w'): | ||
with open(fname, mode) as f: | ||
f.write(repr(self)) | ||
``` | ||
|
||
pandas will create wrapper functions or methods to call the reader or | ||
writer from the pandas standard I/O namespaces. For example, for the | ||
entrypoint above with name `repr` and both methods `reader` and | ||
`writer` implemented, pandas will create the next functions and methods: | ||
|
||
- `pandas.read_repr(...)` | ||
- `pandas.Series.to_repr(...)` | ||
- `pandas.DataFrame.to_repr(...)` | ||
|
||
The reader wrappers make sure that the returned object is a pandas | ||
DataFrame, since the user always expects the return of `read_*()` | ||
to be a pandas DataFrame, not matter what the connector returns. | ||
In few cases, the return can be a list or dict of dataframes, which | ||
is supported. | ||
|
||
If more than one reader or writer with the same name is loaded, pandas | ||
raises an exception. For example, if two connectors use the name | ||
`arrow` pandas will raise when `load_io_plugins()` is called, since | ||
only one `pandas.read_arrow` function can exist, and pandas should not | ||
make an arbitrary decision on which to use. | ||
""" | ||
import functools | ||
import warnings | ||
from importlib.metadata import entry_points | ||
import importlib_metadata | ||
|
||
import pandas as pd | ||
|
||
|
||
def _create_reader_function(io_plugin): | ||
""" | ||
Create and return a wrapper function for the original I/O reader. | ||
|
||
We can't directly call the original reader implemented in | ||
the connector, since the return of third-party connectors is not necessarily | ||
a pandas DataFrame but any object supporting the dataframe interchange | ||
protocol. We make sure here that `read_<whatever>` returns a pandas DataFrame. | ||
""" | ||
|
||
# TODO: Create this function dynamically so the resulting signature contains | ||
# the original parameters and not `*args` and `**kwargs` | ||
@functools.wraps(io_plugin.reader) | ||
def reader_wrapper(*args, **kwargs): | ||
result = io_plugin.reader(*args, **kwargs) | ||
|
||
if isinstance(result, list): | ||
result = [pd.api.interchange.from_dataframe(df) for df in result] | ||
elif isinstance(result, dict): | ||
result = { | ||
k: pd.api.interchange.from_dataframe(df) for k, df in result.items() | ||
} | ||
else: | ||
result = pd.api.interchange.from_dataframe(result) | ||
|
||
return result | ||
|
||
# TODO `function.wraps` changes the name of the wrapped function to the | ||
# original `pandas_reader`, change it to the function exposed in pandas. | ||
return reader_wrapper | ||
|
||
|
||
def _create_series_writer_function(format_name): | ||
""" | ||
When calling `Series.to_<whatever>` we call the dataframe writer, so | ||
we need to convert the Series to a one column dataframe. | ||
""" | ||
|
||
def series_writer_wrapper(self, *args, **kwargs): | ||
dataframe_writer = getattr(self.to_frame(), f"to_{format_name}") | ||
dataframe_writer(*args, **kwargs) | ||
|
||
return series_writer_wrapper | ||
|
||
|
||
def _warn_conflict(func_name, format_name, loaded_plugins, module): | ||
package_to_load = importlib_metadata.packages_distributions()[module.__name__] | ||
if format_name in loaded_plugins: | ||
# conflict with a third-party connector | ||
loaded_module = loaded_plugins[format_name] | ||
loaded_package = importlib_metadata.packages_distributions()[ | ||
loaded_module.__name__ | ||
] | ||
msg = ( | ||
f"Unable to create `{func_name}`. " | ||
f"A conflict exists, because the packages `{loaded_package}` and " | ||
f"`{package_to_load}` both provide the connector for the '{format_name}' format. " | ||
"Please uninstall one of the packages and leave in the current " | ||
"environment only the one you want to use for the '{format_name}' format." | ||
) | ||
else: | ||
# conflict with a pandas connector | ||
msg = ( | ||
f"The package `{package_to_load}` registers `{func_name}`, which is " | ||
"already provided by pandas. The plugin will be ignored." | ||
) | ||
|
||
warnings.warn(msg, UserWarning, stacklevel=1) | ||
|
||
|
||
def load_io_plugins(): | ||
""" | ||
Looks for entrypoints in the `dataframe.io` group and creates the | ||
corresponding pandas I/O methods. | ||
""" | ||
loaded_plugins = {} | ||
|
||
for dataframe_io_entry_point in entry_points().get("dataframe.io", []): | ||
format_name = dataframe_io_entry_point.name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where does this name get defined? Assuming from the name of the library itself? If so maybe worth making this a property of the class so that there is some flexibility for package authors There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the name of the entrypoint. Package authors define it explicitly as the name pandas will use in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any validity to one package providing multiple read/write implementations? An example might be excel where one package offers read_xls alongside read_xlsx, etc... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There shouldn't be any limitation about that |
||
io_plugin = dataframe_io_entry_point.load() | ||
|
||
if hasattr(io_plugin, "reader"): | ||
func_name = f"read_{format_name}" | ||
if hasattr(pd, func_name): | ||
_warn_conflict( | ||
f"pandas.{func_name}", format_name, loaded_plugins, io_plugin | ||
) | ||
delattr(pd, func_name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any risk we'll remove anything unintentional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it should happen. I'll think if we can detect the conflicts before registering anything. I think it's tricky the way it's implemented now, but I think it's easier if we use separate entrypoints for readers and writers, which can be a good idea. If the general concept seems fine, happy to improve the implementation. |
||
else: | ||
setattr( | ||
pd, | ||
f"read_{format_name}", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
_create_reader_function(io_plugin), | ||
) | ||
|
||
if hasattr(io_plugin, "writer"): | ||
func_name = f"to_{format_name}" | ||
if hasattr(pd.DataFrame, func_name): | ||
_warn_conflict( | ||
f"DataFrame.{func_name}", format_name, loaded_plugins, io_plugin | ||
) | ||
delattr(pd.DataFrame, func_name) | ||
delattr(pd.Series, func_name) | ||
else: | ||
setattr( | ||
pd.DataFrame, | ||
func_name, | ||
getattr(io_plugin, "writer"), | ||
) | ||
setattr( | ||
pd.Series, | ||
func_name, | ||
_create_series_writer_function(format_name), | ||
) | ||
|
||
loaded_plugins[format_name] = io_plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an external package? Then make it an (optional) dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a standard library I think, I'll confirm, in case it was installed in the environment by some other package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the stdlib one is importlib.metadata
, and
importlib_metadata` is the backport - does it work with the stdlib one? seems it's new in py3.8 https://docs.python.org/3/library/importlib.metadata.html