-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Extending EAs #51471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Although a large undertaking, I am generally sensing too that an extension "thing that manages EAs" is necessary if we want to encourage other libraries that want to leverage the pandas API but have a different execution paradigm to use pandas extension mechanism and not reinvent the API. I mention extension "thing that manages EAs" as potentially a separate object that may be necessary as I'm skeptical that and array abstraction is enough to satisfy the different execution paradigm that other libraries are after. |
What if we created a In the docs for |
I'm not clear on what problem this is aimed at solving.
I'm not opposed to this, but want to squeeze as much mileage as we reasonably can out of EA before going down that path. |
The idea here is that the cuDF maintainers could use this on GPU's, because we'd have a "limited function" EA that wouldn't support all of pandas, but "enough" of pandas for their library. Again, not sure of this, because I don't have familiarity with cuDF. |
I think expanding EAs to handle these cases is the logical thing to do, otherwise we won't get as much out of it as we could and it won't be really useful to third party packages. |
Generally good idea I can't comment on whether a GPUEA or DistributedEA are feasible, but if they are, then it looks to me like this is the way to go. There's already an intrinsic motivation within pandas to improve support for EAs, so it should be a safe bet for other libraries to spend time implementing extensions if they want to leverage the pandas API |
This seems to me like a good formulation of a key goal that other dataframe libraries have. There's a significant issue with ExtensionArrays as far as I can see though. They are too low-level, meaning you get this kind of code execution:
Going through the pandas internals (step 2), which are optimized for performance on CPU and pandas itself, may not work for other libraries - algorithms may be very slow and need different logic to run on GPU for example. Inside pandas internals, you really don't want to have constraints that can affect the performance characteristics of another library. We had similar conversations in NumPy around dispatching on the public API. Separating the API from the implementation was a key concept, exactly to avoid dealing with the above type of thing (internals "leaking out"). I made some slides about that, this is the key diagram that hopefully explains it: ExtensionArray seems like a nice abstraction to extend pandas itself, in particular with new dtypes and domain-specific things like IP addresses (CyberPandas). But they seem to me like the wrong abstraction when one wants the API and semantics unchanged but affect performance characteristics. And cuDF, Dask et al. are about the latter. They need to be in control of everything that follows right after the user calls the public API. |
I think right now we are too tightly bound to numpy to really allow extension array authors to provide their own implementation. I think we need to go through a full iteration of using Arrow as a complete backend irrespective of NumPy to learn how we can really make that generic and extenisble to third parties |
On a call yesterday with some of the cuDF maintainers, the question came up of why they haven't implemented an ExtensionArray. They pointed to operations where we convert* to numpy (which is very expensive for their hypothetical EA), in particular groupby construction and merge.
* Not actually doing EA.to_numpy(), but having EA.factorize or EA.argsort return ndarrays in these cases means moving everything from a GPU to CPU. Potential modin or dask distributed EAs would have analogous pain points.
I said something to the effect of: "if you implemented EAs, the pandas team would be very-much on-board with helping make sure it worked". In retrospect I should have spoken only for myself, so want to ask: how do folks feel about extending the EA interface in order to make GPU/Distributed EAs viable? cc @pandas-dev/pandas-core
Some thoughts on what this might entail:
groupby construction produces an ndarray[intp] of labels assigning each row (focusing only on axis=0) to a group.
df.groupby(col)
, it is mostlydf[col].factorize(sort=False)
.BaseGrouper.group_info[0]
. So any place that currently uses that would need to be adapted. e.g. theids
keyword in _groupby_op introduced in REF: let EAs override WrappedCythonOp groupby implementations #51166.merge code I haven't looked into as closely
IndexEngine - we have a non-performant EA engine and a performant MaskedEngine. In principle we could allow EAs to bring their own.
Window - no idea what this would take.
Some potential reasons not to do this:
In the groupby case in particular, the data-locality (either for GPU or distributed) needs to be the same for your group labels and each of your columns if you want to be performant. i.e. your columns need to be all-GPU or all-distributed. Maybe EAs aren't the right abstraction for that?
Do we draw the line somewhere? Plotting? I/O?
Early on we wanted to keep the EA namespace limited. This could make it significantly larger.
The text was updated successfully, but these errors were encountered: