ENH: ListDtype / ListArray #35176

TomAugspurger · 2020-07-08T11:06:59Z

This issue is for adding a ListDtype. This might be useful on it's own, and will be useful for #35169 when we have string operations that return a List of values per scalar element.

I think the primary points to discuss are around

How the value_type of the List, the T in List[T], should be specified by the user
How, if at all, to switch between the list_ and large_list types.

xref rapidsai/cudf#5610, where cudf is implementing a ListDtype. Let's chime in over there if we have any thoughts.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-07-08T19:22:32Z

IIRC numba has a TypedList, could do something like that for the str.split-like ops

xhochy · 2020-07-10T09:19:23Z

One of the biggest challenges with a ListType is probably how to differentiate between scalars and list-likes in the pandas API. There are quite some places where you can pass in both and the API behaviour will be slightly different. In the case of a ListDtype, the scalar is also a list-like. This makes it harder to decide which code path should be taken. We probably can use the Type-Information to decide whether we actually have a scalar or not. At the moment this information is sadly not yet present in the dispatching interfaces. For fletcher, I have yet been a bit lazy with this and xfail a lot of these cases https://github.com/xhochy/fletcher/blob/18ac1a348fdd6ccfb096ec5e27c9dedc1e7fc837/tests/test_pandas_extension.py#L74-L86

jorisvandenbossche · 2020-07-10T10:07:03Z

Yes, this is indeed a general problem we need to solve in pandas. We also have been running into this with GeoPandas (eg #26333) and you also already run into corner cases when using iterable elements in object dtype. Other related issues: #27911, #35131

We will probably need some mechanism to let the dtype decide if some value can be a scalar or not.

jorisvandenbossche · 2020-07-10T10:11:58Z

For storing list-like data, I think that will be relatively straightfoward (either just with pyarrow, or even the raw memory layout of Arrow are "just" two arrays with values and offsets).

But right now there are not yet many operations or kernels included in Arrow to work on nested data, I think. In the meantime, awkward-array might be an interesting option to explore to perform more operations on such data (https://github.com/scikit-hep/awkward-array/)

ananis25 · 2020-12-07T13:06:21Z

Could I request to also consider a pandas Extension type for n-dim numpy arrays? Though it probably strays off from the pandas semantics of considering a series as an array like of scalars.

For a lot of data analysis work, the features are generally aligned along an axis like time and thus are suited to pandas. However, with >1D features, pandas coerces them to a numpy array of subarray objects, which causes memory usage to explode. A native type for numpy arrays of arbitrary dimensions would be very helpful (and easily compatible with arrow), even if aggregation ops, etc. are not allowed.

There is a ragtag implementation here, mostly copied from other available examples of extension arrays. The failing extension array tests generally have to do with:

Failed calls to is_scalar routine in pandas internals, which seems to support only numpy/pandas scalars.
Construction of empty series with the extension dtype. I can't quite pin what would be a good NA value.

JulianWgs · 2021-10-24T15:54:36Z

For reference: cuDF (a GPU implementation of Pandas) has now support for ListDtype (Link).

jreback · 2021-10-24T16:51:01Z

looks great @JulianWgs would be great to implement in pandas proper

jbrockmendel · 2023-04-07T18:59:26Z

@mroeschke can we put this into the "use ArrowDtype" pile?

mroeschke · 2023-04-07T20:12:38Z

Yes definitely

mroeschke · 2023-11-21T17:50:46Z

Since this has functionality via ArrowDtype, additional functionality can be build upon that so closing

gwerbin · 2024-02-10T03:34:47Z

@mroeschke is that work planned, or is it only in "hypothetically possible to implement" status?

mroeschke · 2024-02-12T16:59:57Z

This functionality is implemented using pandas.ArrowDtype https://pandas.pydata.org/docs/user_guide/pyarrow.html#data-structure-integration

gwerbin · 2024-02-15T19:03:26Z

This functionality is implemented using pandas.ArrowDtype https://pandas.pydata.org/docs/user_guide/pyarrow.html#data-structure-integration

Thanks for clarifying!

For anyone else coming across this thread, it looks like pd.ArrowDtype(pa.list_(...)) is what I am looking for.

TomAugspurger added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). labels Jul 8, 2020

TomAugspurger mentioned this issue Sep 4, 2020

BUG: ExtensionArray TestMethods.test_where_series (override is_scalar) #33825

Open

3 tasks

jbrockmendel added List-Like Scalars and removed List-Like Scalars labels Sep 21, 2020

jorisvandenbossche mentioned this issue Jan 8, 2021

ENH: 2D support for MaskedArray #38992

Merged

Hoeze mentioned this issue Mar 27, 2021

ENH: Pandas StructDType #40652

Closed

mzeitlin11 mentioned this issue Jun 1, 2021

ENH: allow groupby (and drop_duplicates) on columns containing unhashable types #41759

Closed

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Apr 19, 2023

jbrockmendel mentioned this issue Apr 20, 2023

Make pyarrow a required dependency #52509

Closed

mroeschke closed this as completed Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: ListDtype / ListArray #35176

ENH: ListDtype / ListArray #35176

TomAugspurger commented Jul 8, 2020

jbrockmendel commented Jul 8, 2020

xhochy commented Jul 10, 2020

jorisvandenbossche commented Jul 10, 2020

jorisvandenbossche commented Jul 10, 2020

ananis25 commented Dec 7, 2020 •

edited

Loading

JulianWgs commented Oct 24, 2021

jreback commented Oct 24, 2021

jbrockmendel commented Apr 7, 2023

mroeschke commented Apr 7, 2023

mroeschke commented Nov 21, 2023

gwerbin commented Feb 10, 2024

mroeschke commented Feb 12, 2024

gwerbin commented Feb 15, 2024

ENH: ListDtype / ListArray #35176

ENH: ListDtype / ListArray #35176

Comments

TomAugspurger commented Jul 8, 2020

jbrockmendel commented Jul 8, 2020

xhochy commented Jul 10, 2020

jorisvandenbossche commented Jul 10, 2020

jorisvandenbossche commented Jul 10, 2020

ananis25 commented Dec 7, 2020 • edited Loading

JulianWgs commented Oct 24, 2021

jreback commented Oct 24, 2021

jbrockmendel commented Apr 7, 2023

mroeschke commented Apr 7, 2023

mroeschke commented Nov 21, 2023

gwerbin commented Feb 10, 2024

mroeschke commented Feb 12, 2024

gwerbin commented Feb 15, 2024

ananis25 commented Dec 7, 2020 •

edited

Loading