-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: ListDtype / ListArray #35176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IIRC numba has a TypedList, could do something like that for the str.split-like ops |
One of the biggest challenges with a ListType is probably how to differentiate between scalars and list-likes in the pandas API. There are quite some places where you can pass in both and the API behaviour will be slightly different. In the case of a ListDtype, the scalar is also a list-like. This makes it harder to decide which code path should be taken. We probably can use the Type-Information to decide whether we actually have a scalar or not. At the moment this information is sadly not yet present in the dispatching interfaces. For |
Yes, this is indeed a general problem we need to solve in pandas. We also have been running into this with GeoPandas (eg #26333) and you also already run into corner cases when using iterable elements in object dtype. Other related issues: #27911, #35131 We will probably need some mechanism to let the dtype decide if some value can be a scalar or not. |
For storing list-like data, I think that will be relatively straightfoward (either just with pyarrow, or even the raw memory layout of Arrow are "just" two arrays with values and offsets). But right now there are not yet many operations or kernels included in Arrow to work on nested data, I think. In the meantime, |
Could I request to also consider a pandas Extension type for n-dim numpy arrays? Though it probably strays off from the pandas semantics of considering a series as an array like of scalars. For a lot of data analysis work, the features are generally aligned along an axis like time and thus are suited to pandas. However, with >1D features, pandas coerces them to a numpy array of subarray objects, which causes memory usage to explode. A native type for numpy arrays of arbitrary dimensions would be very helpful (and easily compatible with arrow), even if aggregation ops, etc. are not allowed. There is a ragtag implementation here, mostly copied from other available examples of extension arrays. The failing extension array tests generally have to do with:
|
For reference: cuDF (a GPU implementation of Pandas) has now support for ListDtype (Link). |
looks great @JulianWgs would be great to implement in pandas proper |
@mroeschke can we put this into the "use ArrowDtype" pile? |
Yes definitely |
Since this has functionality via ArrowDtype, additional functionality can be build upon that so closing |
@mroeschke is that work planned, or is it only in "hypothetically possible to implement" status? |
This functionality is implemented using |
Thanks for clarifying! For anyone else coming across this thread, it looks like |
This issue is for adding a ListDtype. This might be useful on it's own, and will be useful for #35169 when we have string operations that return a List of values per scalar element.
I think the primary points to discuss are around
value_type
of the List, theT
inList[T]
, should be specified by the userlist_
andlarge_list
types.xref rapidsai/cudf#5610, where cudf is implementing a ListDtype. Let's chime in over there if we have any thoughts.
The text was updated successfully, but these errors were encountered: