-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC of PDEP-9 (I/O plugins) #53005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC of PDEP-9 (I/O plugins) #53005
Conversation
pandas/__init__.py
Outdated
if hasattr(io_plugin, "read"): | ||
globals()[f"read_{dataframe_io_entry_point.name}"] = io_plugin.read | ||
if hasattr(io_plugin, "write"): | ||
setattr(DataFrame, f"to_{dataframe_io_entry_point.name}", io_plugin.write) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check that dataframe_io_entry_point.name isn't overwriting something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. Probably some other checks could be useful too. So far my goal was more to show what PDEP-9 could imply in terms of code, as I don't understand all the opposition for what in my opinion is a small change with huge benefits. So I guess a MVP implementation can help undertand understand what are the implications to the PDEP. But fully agree we should raise if two installed packages use the same entrypoint name, iirc it's mentioned in the PDEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
to be clear, my only opposition to pdep9 was in renaming the hugely established pd.read_csv
(which is also the most visited page in the docs, according to the analytics you sent me)
adding an optional plugin system like this which allows third-party authors to develop readers/writers sounds like a net positive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying your position @MarcoGorelli.
I think that was great feedback. While it'd be nice to have a better I/O API IMHO, probably not worth the change, and in any case, that can be discussed separately, as it's independent and adds noise to the discussion about plugins.
I was concerned of adding to much stuff to the pandas already huge namespaces, but in a second thought, if we eventually move connectors like SAS or BigQuery to third-party projects, most users will probably end up having less connectors than now, not more.
A couple other thoughts:
|
I guess we don't really do either of these for the matplotlib backends, nor do I think pandas necessarily needs to enforce these. |
I think users knowing where things come from is a good point. Ideally we'd like to make things be as little as magic as possible even for users without a deep understanding of Python. Users can iterate entrypoints in the same way as done in this PR to see what connectors are available in their environtment. Afaik entrypoints can provide the name of the module, but not the name of the pip/conda package that added them. I'd personally start simple, but depending on the feedback could be reasonable to have something like For the versioning, I find it out of scope. Compatibility of packages should be at the package level (pip/conda dependencies). And if we want to get into the business of recommending users to have newer versions it'd probably make more sense to start by recommending updates to pandas, or to numpy, if users aren't using the latest versions, and not here, don't you think? @mroeschke I'll remove the PDEP label, iirc our website will believe this is a PDEP, and publish it in the list of proposals under discussion (or fail because the title doesn't start by PDEP-x. |
@pandas-dev/pandas-core I updated this POC to show in practice what PDEP-9 proposes regarding Arrow, which based on the feedback received I don't think it was clear. The current approach is that pandas will expect connectors to implement In case a connector doesn't implement functions that exchange a pandas dataframe, then pandas will fallback to see if the connector can exchange data with a PyArrow object. If PyArrow is installed (it will be, since it'll be a dependency of the connector), then pandas will read the PyArrow table from the connector, cast it to a dataframe, and return it for the This PyArrow fallback options adds value in at least two different ways:
@jbrockmendel @simonjayhawkins @jreback does this seem reasonable to you? I think you were the main people who expressed concerns about using Arrow as an interface. Finally, in this version I added a function |
pandas/io/_plugin_loader.py
Outdated
# the original parameters and not `*args` and `**kwargs` | ||
@functools.wraps(original_reader) | ||
def reader_wrapper(*args, **kwargs): | ||
result = original_reader(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth for a third-party reader/writer to have a way of declaring whether we should "pre-process" a path
argument? get_handle
currently does a lot of heavy lifting to unify compression, text/binary, str/file-object, and local/remote files across most IO functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point @twoertwein, thanks for bringing this up. I remember there is a library that does all that in a similar way of what pandas implements. I couldn't find it now, but probably the simplest option is to let connectors use that library and keep this out of the connector API. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinion: for end-users, it would be nice to know that any read/to functions accepts compression/file-handles/pathlib/str but it would also be nice to not make this PDEP more complicated :)
pandas/io/_plugin_loader.py
Outdated
@functools.wraps(original_reader) | ||
def reader_wrapper(*args, **kwargs): | ||
result = original_reader(*args, **kwargs) | ||
if exchange_format == "pyarrow": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be be more generic to convert result
if it implements the interchange protocol instead? Such that
- If
original_reader
uses/returns Arrow objects,original_reader
can also customize how the Arrow objects are converted to pd.DataFrame via*args, **kwargs
- Maybe a little more future proof to Arrow as a format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a very good point @mroeschke. Both PyArrow Table and pandas DataFrame support it, so using the interchange protocol in the pandas side of this connectors API seems to solve the problem in a much simpler way.
I don't think there shouldn't be performance implications on using it, but maybe worth to check if data is not being copied with the current implementations of the interchange protocol.
I updated this PR and the lance-reader repo to use what you suggest, please let me know if this is what you had in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes perfect! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative was to load them at
import pandas
which doesn't make it possible to disable it via options, which would be executed when connectors are already loaded. I think making this opt-in initially is a good idea, and if in a future version we want to make plugins load by default...
I'm fine with the current implementation, but I do feel strongly against making this the default with the current behavior of raising when there is a conflict. To me, that is an entire new level of dependency hell where say pip installs just fine but I can't use package X because I have package Y and Z installed.
More generally, I think requiring users to modify their environment to use this feature should be avoided at all costs. Would it be possible to have an option specifying which ones to skip?
I agree, but I don't see an easy solution to this. The current version doesn't load plugins by default, so if I understand you correctly, you're talking in a future version (e.g. 2.2 or later) if we remove I don't think pandas should know anything about third-party connectors (let me know if you disagree with that). If that's the case, we don't have any way in the pandas side to prevent two connectors using the same name. So, as I see things the optionsif two installed packages request to use the name
While none of the options are perfect, 1 is the only reasonable option to me (regardless of whether we load plugins at import or explicitly, that it's probably a separate discussion). What do you think @rhshadrach? |
Correct. I'm +1 as long we either require calling
As expressed in #53005 (review), I do not find this approach viable. Relying on inducing a bad user experience for our users to encourage change in other packages does not seem like a good strategy to me.
While not great, I think this would be okay if users had a way (e.g. a pandas option?) to change which one is loaded. Is this possible?
This could mean installing a package breaks functioning code; -1 here. Even if you leave one named |
Sounds good in theory, but I fail to see how users could specify which option to use, either when loading the plugins, or as an option later. If we only have the name (e.g. two implementations for
The only thing that comes to my mind would be to use PyPI names to enforce uniqueness. Not sure if the entrypoints API allows me to see the package name, or what to do with conda-forge, so probably a bad idea. But that's the only way I can think of. |
A fourth option would be to warn when loading conflicting plugins, and only raise when calling the conflicting name (the name will will then not be a real one but a shim that raises an exception). There will no doubt be situations when want to access some plugin, but not the conflicting one, so I can see @rhshadrach point that it will be annoying to get an error in those situations, but as warning first and the raise if you actually access it could be a way forward? |
Good point. A warning sounds good too. I think it's simpler instead of an exception if the function/method is accessed, to simply not create it. What do you think? |
I think that's also possible. The warning text should be clear about what the problem is. One thing though I like about raising errors is that it gives a nice entry point into the debugger and let's you go back to where the error occurred. I'm sure that is also possible with warnings, though I could google it, my point is just that the work flow in debugging errors is more natural than debugging warnings. Or that's just me who knows. I could accept both just a warning or a warning plus exception if the name is called. |
@rhshadrach @topper-123 thanks a lot for the feedback, it was very useful. I updated the PR, and now it only warns if connectors conflict and are being ignored (not registered). I think this is surely better, and I could get the name of the pip/conda package, so the error message should be quite informative on what needs to be done. I didn't try it yet, as it's not trivial, but it'd be great to know your feedback on this approach before moving forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the latest commit. I haven't read the whole PR, but will read it now.
else: | ||
setattr( | ||
pd, | ||
f"read_{format_name}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"read_{format_name}"
-> func_name
.
_warn_conflict( | ||
f"pandas.{func_name}", format_name, loaded_plugins, io_plugin | ||
) | ||
delattr(pd, func_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk we'll remove anything unintentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should happen. I'll think if we can detect the conflicts before registering anything. I think it's tricky the way it's implemented now, but I think it's easier if we use separate entrypoints for readers and writers, which can be a good idea.
If the general concept seems fine, happy to improve the implementation.
from importlib.metadata import entry_points | ||
import importlib_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an external package? Then make it an (optional) dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a standard library I think, I'll confirm, in case it was installed in the environment by some other package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the stdlib one is importlib.metadata, and
importlib_metadata` is the backport - does it work with the stdlib one? seems it's new in py3.8 https://docs.python.org/3/library/importlib.metadata.html
loaded_plugins = {} | ||
|
||
for dataframe_io_entry_point in entry_points().get("dataframe.io", []): | ||
format_name = dataframe_io_entry_point.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this name get defined? Assuming from the name of the library itself? If so maybe worth making this a property of the class so that there is some flexibility for package authors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the name of the entrypoint. Package authors define it explicitly as the name pandas will use in read_<name>
... It's not use for anything else. The only constrain is that the name Dask, Vaex, Polars... Will receive if they ever use this connector API will be the same. Personally I think that's good, but not sure if for any case the same connector would want to use different names in different libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any validity to one package providing multiple read/write implementations? An example might be excel where one package offers read_xls alongside read_xlsx, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any limitation about that
Closing PDEP-9 as rejected, so not moving forward with this. See #51799. |
Proof of concept of what's being discussed in #51799.
An example of a plugin is implemented in https://github.com/datapythonista/lance-reader
Both together allow this code: