POC of PDEP-9 (I/O plugins) #53005

datapythonista · 2023-04-29T16:47:15Z

Proof of concept of what's being discussed in #51799.

An example of a plugin is implemented in https://github.com/datapythonista/lance-reader

Both together allow this code:

import pandas

pandas.load_io_plugins()

df = pandas.read_lance('sample.lance')
df.to_lance('sample_copy.lance')

jbrockmendel · 2023-04-30T17:35:23Z

pandas/__init__.py

+    if hasattr(io_plugin, "read"):
+        globals()[f"read_{dataframe_io_entry_point.name}"] = io_plugin.read
+    if hasattr(io_plugin, "write"):
+        setattr(DataFrame, f"to_{dataframe_io_entry_point.name}", io_plugin.write)


check that dataframe_io_entry_point.name isn't overwriting something?

Yes, good point. Probably some other checks could be useful too. So far my goal was more to show what PDEP-9 could imply in terms of code, as I don't understand all the opposition for what in my opinion is a small change with huge benefits. So I guess a MVP implementation can help undertand understand what are the implications to the PDEP. But fully agree we should raise if two installed packages use the same entrypoint name, iirc it's mentioned in the PDEP.

nice!

to be clear, my only opposition to pdep9 was in renaming the hugely established pd.read_csv (which is also the most visited page in the docs, according to the analytics you sent me)

adding an optional plugin system like this which allows third-party authors to develop readers/writers sounds like a net positive

Thanks for clarifying your position @MarcoGorelli.

I think that was great feedback. While it'd be nice to have a better I/O API IMHO, probably not worth the change, and in any case, that can be discussed separately, as it's independent and adds noise to the discussion about plugins.

I was concerned of adding to much stuff to the pandas already huge namespaces, but in a second thought, if we eventually move connectors like SAS or BigQuery to third-party projects, most users will probably end up having less connectors than now, not more.

jbrockmendel · 2023-04-30T19:26:11Z

A couple other thoughts:

Do users have a way to tell which things are 3rd party? Many people don't use virtualenvs etc or keep track of what they have installed or why.
When users upgrade a pandas version, can/should we warn or something on existing connectors and tell them to upgrade those too? Or maybe have the connector define a max_(tested|supported)_pandas_version?

mroeschke · 2023-05-01T17:44:42Z

A couple other thoughts:

1. Do users have a way to tell which things are 3rd party?  Many people don't use virtualenvs etc or keep track of what they have installed or why.

2. When users upgrade a pandas version, can/should we warn or something on existing connectors and tell them to upgrade those too?  Or maybe have the connector define a `max_(tested|supported)_pandas_version`?

I guess we don't really do either of these for the matplotlib backends, nor do I think pandas necessarily needs to enforce these.

datapythonista · 2023-05-01T18:16:26Z

I think users knowing where things come from is a good point. Ideally we'd like to make things be as little as magic as possible even for users without a deep understanding of Python. Users can iterate entrypoints in the same way as done in this PR to see what connectors are available in their environtment. Afaik entrypoints can provide the name of the module, but not the name of the pip/conda package that added them. I'd personally start simple, but depending on the feedback could be reasonable to have something like pandas.list_connectors() or whatever to make users life easier.

For the versioning, I find it out of scope. Compatibility of packages should be at the package level (pip/conda dependencies). And if we want to get into the business of recommending users to have newer versions it'd probably make more sense to start by recommending updates to pandas, or to numpy, if users aren't using the latest versions, and not here, don't you think?

@mroeschke I'll remove the PDEP label, iirc our website will believe this is a PDEP, and publish it in the list of proposals under discussion (or fail because the title doesn't start by PDEP-x.

datapythonista · 2023-05-22T08:36:35Z

@pandas-dev/pandas-core I updated this POC to show in practice what PDEP-9 proposes regarding Arrow, which based on the feedback received I don't think it was clear.

The current approach is that pandas will expect connectors to implement pandas_reader and pandas_writer which communicate with pandas with a pandas DataFrame.

In case a connector doesn't implement functions that exchange a pandas dataframe, then pandas will fallback to see if the connector can exchange data with a PyArrow object. If PyArrow is installed (it will be, since it'll be a dependency of the connector), then pandas will read the PyArrow table from the connector, cast it to a dataframe, and return it for the pandas.read_whatever(...) call. This design opens the door to support nanoarrow, a Rust Arrow implementation and others in the future, if that is useful, with minimal changes required in the pandas side.

This PyArrow fallback options adds value in at least two different ways:

It makes the API between connectors and pandas simpler, as an Arrow structure (PyArrow Table) is more concise than a pandas dataframe, which could have custom extension arrays, different backend types...
It makes this I/O interface available to projects that don't want or can't depend on pandas. Possibly increasing its adoption, and having more projects/devs testing and helping maintain connectors

@jbrockmendel @simonjayhawkins @jreback does this seem reasonable to you? I think you were the main people who expressed concerns about using Arrow as an interface.

Finally, in this version I added a function pandas.load_io_plugins() that need to be called to load the I/O connectors. The alternative was to load them at import pandas which doesn't make it possible to disable it via options, which would be executed when connectors are already loaded. I think making this opt-in initially is a good idea, and if in a future version we want to make plugins load by default, the deprecation cycle should be trivial, warning when users call the function so they remove the call.

pandas/io/_plugin_loader.py

twoertwein · 2023-05-22T14:33:19Z

pandas/io/_plugin_loader.py

+    # the original parameters and not `*args` and `**kwargs`
+    @functools.wraps(original_reader)
+    def reader_wrapper(*args, **kwargs):
+        result = original_reader(*args, **kwargs)


Would it be worth for a third-party reader/writer to have a way of declaring whether we should "pre-process" a path argument? get_handle currently does a lot of heavy lifting to unify compression, text/binary, str/file-object, and local/remote files across most IO functions.

Very good point @twoertwein, thanks for bringing this up. I remember there is a library that does all that in a similar way of what pandas implements. I couldn't find it now, but probably the simplest option is to let connectors use that library and keep this out of the connector API. What do you think?

I have no strong opinion: for end-users, it would be nice to know that any read/to functions accepts compression/file-handles/pathlib/str but it would also be nice to not make this PDEP more complicated :)

mroeschke · 2023-05-22T19:19:43Z

pandas/io/_plugin_loader.py

+    @functools.wraps(original_reader)
+    def reader_wrapper(*args, **kwargs):
+        result = original_reader(*args, **kwargs)
+        if exchange_format == "pyarrow":


Would be be more generic to convert result if it implements the interchange protocol instead? Such that

If original_reader uses/returns Arrow objects, original_reader can also customize how the Arrow objects are converted to pd.DataFrame via *args, **kwargs

Maybe a little more future proof to Arrow as a format?

Yes, that's a very good point @mroeschke. Both PyArrow Table and pandas DataFrame support it, so using the interchange protocol in the pandas side of this connectors API seems to solve the problem in a much simpler way.

I don't think there shouldn't be performance implications on using it, but maybe worth to check if data is not being copied with the current implementations of the interchange protocol.

I updated this PR and the lance-reader repo to use what you suggest, please let me know if this is what you had in mind.

Yes perfect! Thanks

rhshadrach

The alternative was to load them at import pandas which doesn't make it possible to disable it via options, which would be executed when connectors are already loaded. I think making this opt-in initially is a good idea, and if in a future version we want to make plugins load by default...

I'm fine with the current implementation, but I do feel strongly against making this the default with the current behavior of raising when there is a conflict. To me, that is an entire new level of dependency hell where say pip installs just fine but I can't use package X because I have package Y and Z installed.

More generally, I think requiring users to modify their environment to use this feature should be avoided at all costs. Would it be possible to have an option specifying which ones to skip?

datapythonista · 2023-05-27T12:10:02Z

I'm fine with the current implementation, but I do feel strongly against making this the default with the current behavior of raising when there is a conflict. To me, that is an entire new level of dependency hell where say pip installs just fine but I can't use package X because I have package Y and Z installed.

More generally, I think requiring users to modify their environment to use this feature should be avoided at all costs. Would it be possible to have an option specifying which ones to skip?

I agree, but I don't see an easy solution to this. The current version doesn't load plugins by default, so if I understand you correctly, you're talking in a future version (e.g. 2.2 or later) if we remove pandas.load_io_plugins() and we call it automatically at import pandas.

I don't think pandas should know anything about third-party connectors (let me know if you disagree with that). If that's the case, we don't have any way in the pandas side to prevent two connectors using the same name. So, as I see things the optionsif two installed packages request to use the name pandas.read_X are:

We raise an exception. This hopefully will cause frustration of the users using the connector that duplicated the name, and the authors will probably change it.
We load one of them with an arbitrary criteria (the first or last to be returned by the Python entrypoints API)
We rename them to something like pandas.read_X_1 and pandas.read_X_2

While none of the options are perfect, 1 is the only reasonable option to me (regardless of whether we load plugins at import or explicitly, that it's probably a separate discussion). What do you think @rhshadrach?

rhshadrach · 2023-05-27T14:17:55Z

so if I understand you correctly, you're talking in a future version (e.g. 2.2 or later) if we remove pandas.load_io_plugins() and we call it automatically at import pandas.

Correct. I'm +1 as long we either require calling pandas.load_io_plugins() or finding a good way to handle conflicting packages. With requiring pandas.load_io_pluging(), can we also have an option to determine what one loads in case of a conflict?

We raise an exception. This hopefully will cause frustration of the users using the connector that duplicated the name, and the authors will probably change it.

As expressed in #53005 (review), I do not find this approach viable. Relying on inducing a bad user experience for our users to encourage change in other packages does not seem like a good strategy to me.

We load one of them with an arbitrary criteria (the first or last to be returned by the Python entrypoints API)

While not great, I think this would be okay if users had a way (e.g. a pandas option?) to change which one is loaded. Is this possible?

We rename them to something like pandas.read_X_1 and pandas.read_X_2

This could mean installing a package breaks functioning code; -1 here. Even if you leave one named pandas.read_X, I think this is still not viable. We do have somewhat similar behavior here: e.g. calling DataFrame.to_excel without specifying the engine argument also depends on what packages are installed (e.g. if you only have openpyxl vs only have xlxswriter installed). But at least I can specify engine="openpyxl" which will then guarantee the desired engine (or raise if not installed). I don't think that is possible here.

datapythonista · 2023-05-27T14:32:28Z

While not great, I think this would be okay if users had a way (e.g. a pandas option?) to change which one is loaded. Is this possible?

Sounds good in theory, but I fail to see how users could specify which option to use, either when loading the plugins, or as an option later. If we only have the name (e.g. two implementations for pandas.read_deltalake) how can I reference one of them individually? We could add an engine value to this API, but I think that creates more problems than solutions:

If we can't expect packages to do the right thing and not duplicate names, why we should expect engines to be unique? We are in the same problem (with a more complex solution)
The current approach assumes all engines of a single format to have the same signature, which I think it's far from optimal. For example, I'd like to have a CSV reader with a schema parameter, but this is not possible in pandas since read_csv(engine='*', ...) has the same signature for all CSV readers.

The only thing that comes to my mind would be to use PyPI names to enforce uniqueness. Not sure if the entrypoints API allows me to see the package name, or what to do with conda-forge, so probably a bad idea. But that's the only way I can think of.

topper-123 · 2023-05-27T18:43:17Z

A fourth option would be to warn when loading conflicting plugins, and only raise when calling the conflicting name (the name will will then not be a real one but a shim that raises an exception).

There will no doubt be situations when want to access some plugin, but not the conflicting one, so I can see @rhshadrach point that it will be annoying to get an error in those situations, but as warning first and the raise if you actually access it could be a way forward?

datapythonista · 2023-05-27T18:46:13Z

Good point. A warning sounds good too. I think it's simpler instead of an exception if the function/method is accessed, to simply not create it. What do you think?

topper-123 · 2023-05-27T19:20:22Z

I think that's also possible. The warning text should be clear about what the problem is.

One thing though I like about raising errors is that it gives a nice entry point into the debugger and let's you go back to where the error occurred. I'm sure that is also possible with warnings, though I could google it, my point is just that the work flow in debugging errors is more natural than debugging warnings. Or that's just me who knows.

I could accept both just a warning or a warning plus exception if the name is called.

datapythonista · 2023-05-28T09:32:30Z

@rhshadrach @topper-123 thanks a lot for the feedback, it was very useful. I updated the PR, and now it only warns if connectors conflict and are being ignored (not registered). I think this is surely better, and I could get the name of the pip/conda package, so the error message should be quite informative on what needs to be done.

I didn't try it yet, as it's not trivial, but it'd be great to know your feedback on this approach before moving forward.

topper-123

A few comments on the latest commit. I haven't read the whole PR, but will read it now.

topper-123 · 2023-05-28T10:54:52Z

pandas/io/_plugin_loader.py

+            else:
+                setattr(
+                    pd,
+                    f"read_{format_name}",


f"read_{format_name}" -> func_name.

topper-123 · 2023-05-28T10:55:44Z

pandas/io/_plugin_loader.py

+                _warn_conflict(
+                    f"pandas.{func_name}", format_name, loaded_plugins, io_plugin
+                )
+                delattr(pd, func_name)


Is there any risk we'll remove anything unintentional?

I don't think it should happen. I'll think if we can detect the conflicts before registering anything. I think it's tricky the way it's implemented now, but I think it's easier if we use separate entrypoints for readers and writers, which can be a good idea.

If the general concept seems fine, happy to improve the implementation.

topper-123 · 2023-05-28T10:58:14Z

pandas/io/_plugin_loader.py

 from importlib.metadata import entry_points
+import importlib_metadata


Is this an external package? Then make it an (optional) dependency?

It's a standard library I think, I'll confirm, in case it was installed in the environment by some other package.

looks like the stdlib one is importlib.metadata, and importlib_metadata` is the backport - does it work with the stdlib one? seems it's new in py3.8 https://docs.python.org/3/library/importlib.metadata.html

WillAyd · 2023-05-30T19:41:21Z

pandas/io/_plugin_loader.py

+    loaded_plugins = {}
+
+    for dataframe_io_entry_point in entry_points().get("dataframe.io", []):
+        format_name = dataframe_io_entry_point.name


Where does this name get defined? Assuming from the name of the library itself? If so maybe worth making this a property of the class so that there is some flexibility for package authors

This is the name of the entrypoint. Package authors define it explicitly as the name pandas will use in read_<name>... It's not use for anything else. The only constrain is that the name Dask, Vaex, Polars... Will receive if they ever use this connector API will be the same. Personally I think that's good, but not sure if for any case the same connector would want to use different names in different libraries.

Is there any validity to one package providing multiple read/write implementations? An example might be excel where one package offers read_xls alongside read_xlsx, etc...

There shouldn't be any limitation about that

datapythonista · 2023-06-13T12:27:32Z

Closing PDEP-9 as rejected, so not moving forward with this. See #51799.

POC of PDEP-9 (I/O plugins)

c0d0115

datapythonista mentioned this pull request Apr 29, 2023

PDEP-9: Allow third-party projects to register pandas connectors with a standard API #51799

Merged

MarcoGorelli self-requested a review April 30, 2023 07:58

jbrockmendel reviewed Apr 30, 2023

View reviewed changes

mroeschke added IO Data IO issues that don't fit into a more specific label PDEP pandas enhancement proposal labels May 1, 2023

datapythonista removed the PDEP pandas enhancement proposal label May 1, 2023

datapythonista added 5 commits May 22, 2023 10:58

Implementing the POC with a pyarrow fallback as the connector protocol

91da43a

Black+isort

67a69a9

Adding docstring and black

2439ed9

Minor fixes to __init__.py

2b0e13f

raising if to_ method exists

000ea21

twoertwein reviewed May 22, 2023

View reviewed changes

pandas/io/_plugin_loader.py Outdated Show resolved Hide resolved

twoertwein reviewed May 22, 2023

View reviewed changes

mroeschke reviewed May 22, 2023

View reviewed changes

rhshadrach reviewed May 22, 2023

View reviewed changes

Use the dataframe interchange protocol instead

59b0c3a

datapythonista added 2 commits May 28, 2023 12:43

Merge remote-tracking branch 'upstream/main' into pdep9_impl

b511fe4

Warning on conflict

51f7588

topper-123 reviewed May 28, 2023

View reviewed changes

WillAyd reviewed May 30, 2023

View reviewed changes

datapythonista closed this Jun 13, 2023

dimastbk mentioned this pull request Sep 4, 2023

ENH: add calamine excel reader (close #50395) #54998

Merged

5 tasks

		from importlib.metadata import entry_points
		import importlib_metadata

Uh oh!

POC of PDEP-9 (I/O plugins) #53005

POC of PDEP-9 (I/O plugins) #53005

Uh oh!

Conversation

datapythonista commented Apr 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Apr 30, 2023

Uh oh!

mroeschke commented May 1, 2023

Uh oh!

datapythonista commented May 1, 2023

Uh oh!

datapythonista commented May 22, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datapythonista commented May 27, 2023

Uh oh!

rhshadrach commented May 27, 2023

Uh oh!

datapythonista commented May 27, 2023

Uh oh!

topper-123 commented May 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datapythonista commented May 27, 2023

Uh oh!

topper-123 commented May 27, 2023

Uh oh!

datapythonista commented May 28, 2023

Uh oh!

topper-123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datapythonista commented Jun 13, 2023

datapythonista commented Apr 29, 2023 •

edited

Loading

rhshadrach left a comment •

edited

Loading

topper-123 commented May 27, 2023 •

edited

Loading