Skip to content

POC of PDEP-9 (I/O plugins) #53005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
90 changes: 63 additions & 27 deletions pandas/io/_plugin_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,9 @@ def writer(self, fname, mode='w'):
make an arbitrary decision on which to use.
"""
import functools
import warnings
from importlib.metadata import entry_points
import importlib_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an external package? Then make it an (optional) dependency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a standard library I think, I'll confirm, in case it was installed in the environment by some other package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the stdlib one is importlib.metadata, and importlib_metadata` is the backport - does it work with the stdlib one? seems it's new in py3.8 https://docs.python.org/3/library/importlib.metadata.html


import pandas as pd

Expand All @@ -75,8 +77,9 @@ def reader_wrapper(*args, **kwargs):
if isinstance(result, list):
result = [pd.api.interchange.from_dataframe(df) for df in result]
elif isinstance(result, dict):
result = {k: pd.api.interchange.from_dataframe(df)
for k, df in result.items()}
result = {
k: pd.api.interchange.from_dataframe(df) for k, df in result.items()
}
else:
result = pd.api.interchange.from_dataframe(result)

Expand All @@ -92,49 +95,82 @@ def _create_series_writer_function(format_name):
When calling `Series.to_<whatever>` we call the dataframe writer, so
we need to convert the Series to a one column dataframe.
"""

def series_writer_wrapper(self, *args, **kwargs):
dataframe_writer = getattr(self.to_frame(), f"to_{format_name}")
dataframe_writer(*args, **kwargs)

return series_writer_wrapper


def _warn_conflict(func_name, format_name, loaded_plugins, module):
package_to_load = importlib_metadata.packages_distributions()[module.__name__]
if format_name in loaded_plugins:
# conflict with a third-party connector
loaded_module = loaded_plugins[format_name]
loaded_package = importlib_metadata.packages_distributions()[
loaded_module.__name__
]
msg = (
f"Unable to create `{func_name}`. "
f"A conflict exists, because the packages `{loaded_package}` and "
f"`{package_to_load}` both provide the connector for the '{format_name}' format. "
"Please uninstall one of the packages and leave in the current "
"environment only the one you want to use for the '{format_name}' format."
)
else:
# conflict with a pandas connector
msg = (
f"The package `{package_to_load}` registers `{func_name}`, which is "
"already provided by pandas. The plugin will be ignored."
)

warnings.warn(msg, UserWarning, stacklevel=1)


def load_io_plugins():
"""
Looks for entrypoints in the `dataframe.io` group and creates the
corresponding pandas I/O methods.
"""
loaded_plugins = {}

for dataframe_io_entry_point in entry_points().get("dataframe.io", []):
format_name = dataframe_io_entry_point.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this name get defined? Assuming from the name of the library itself? If so maybe worth making this a property of the class so that there is some flexibility for package authors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the name of the entrypoint. Package authors define it explicitly as the name pandas will use in read_<name>... It's not use for anything else. The only constrain is that the name Dask, Vaex, Polars... Will receive if they ever use this connector API will be the same. Personally I think that's good, but not sure if for any case the same connector would want to use different names in different libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any validity to one package providing multiple read/write implementations? An example might be excel where one package offers read_xls alongside read_xlsx, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any limitation about that

io_plugin = dataframe_io_entry_point.load()

if hasattr(io_plugin, "reader"):
if hasattr(pd, f"read_{format_name}"):
raise RuntimeError(
"More than one installed library provides the "
"`read_{format_name}` reader. Please uninstall one of "
"the I/O plugins providing connectors for this format."
func_name = f"read_{format_name}"
if hasattr(pd, func_name):
_warn_conflict(
f"pandas.{func_name}", format_name, loaded_plugins, io_plugin
)
delattr(pd, func_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk we'll remove anything unintentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should happen. I'll think if we can detect the conflicts before registering anything. I think it's tricky the way it's implemented now, but I think it's easier if we use separate entrypoints for readers and writers, which can be a good idea.

If the general concept seems fine, happy to improve the implementation.

else:
setattr(
pd,
f"read_{format_name}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"read_{format_name}" -> func_name.

_create_reader_function(io_plugin),
)
setattr(
pd,
f"read_{format_name}",
_create_reader_function(io_plugin),
)

if hasattr(io_plugin, "writer"):
if hasattr(pd.DataFrame, f"to_{format_name}"):
raise RuntimeError(
"More than one installed library provides the "
"`to_{format_name}` reader. Please uninstall one of "
"the I/O plugins providing connectors for this format."
func_name = f"to_{format_name}"
if hasattr(pd.DataFrame, func_name):
_warn_conflict(
f"DataFrame.{func_name}", format_name, loaded_plugins, io_plugin
)
delattr(pd.DataFrame, func_name)
delattr(pd.Series, func_name)
else:
setattr(
pd.DataFrame,
func_name,
getattr(io_plugin, "writer"),
)
setattr(
pd.DataFrame,
f"to_{format_name}",
getattr(io_plugin, "writer"),
)
setattr(
pd.Series,
f"to_{format_name}",
_create_series_writer_function(format_name),
)
setattr(
pd.Series,
func_name,
_create_series_writer_function(format_name),
)

loaded_plugins[format_name] = io_plugin