ENH: Support explode for ArrowDtype #53373

csubhodeep · 2023-05-24T16:49:32Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I wish for the functions that are still not supported an Exception is thrown. For instance, for the below snippet I try and use the explode method on two DataFrames one having pyarrow backed data type and another numpy.

import pyarrow as pa
import pyarrow.parquet as pq

pydict = {'x': [[1,2], [3]], 'y': ['a', 'b']}
table = pa.Table.from_pydict(pydict)
print(table.schema)

x: list<item: int64>
  child 0, item: int64
y: string

pq.write_table(table, 'test.parquet')
df_pa = pd.read_parquet('test.parquet', dtype_backend='pyarrow')
df = pd.read_parquet('test.parquet')

I make the DataFrames by reading from the parquet files because when doing in-memory using the convert_dtypes method it casts column x as object.

If I check schema

df_pa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype                     
---  ------  --------------  -----                     
 0   x       2 non-null      list<item: int64>[pyarrow]
 1   y       2 non-null      string[pyarrow]           
dtypes: list<item: int64>[pyarrow](1), string[pyarrow](1)
memory usage: 170.0 bytes

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       2 non-null      object
 1   y       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes

If I try to extract one element from each

df_pa.iloc[0].values

array([list([1, 2]), 'a'], dtype=object)

df.iloc[0].values

array([array([1, 2]), 'a'], dtype=object)

I do note how the dtype is different in both.

FInally, if I try to call the explode method on:

The numpy backed dtype

df.explode('x')

Output

The pyarrow backed dtype

df_pa.explode('x')

Output (without any error)

       x  y
0  [1 2]  a
1    [3]  b

So basically, this was not the result I was expecting.

P.S. I was not sure if this would rather qualify as a bug.

Platform details

Platform:    macOS-13.3.1-arm64-arm-64bit
Python:      3.9.16 (main, May 16 2023, 20:00:19) 
[Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy:       1.24.3
pandas:      2.0.1
pyarrow:     10.0.1

Feature Description

I don't know enough about the internals of pandas to propose anything here.

Alternative Solutions

For the functions which are not natively supported for pyarrow backed data frames could we internally convert them to the numpy one, do the operation and convert them back? In this case a warning message would make sense.

But again since I don't know enough whether such a make-shift solution meets the standards wrt performance/api design etc I would not be able to make a suggestion here.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

csubhodeep · 2023-05-31T08:37:17Z

Also please confirm if the operations listed here are the only ones that are supported for the pyarrow backed dtypes. Thanks ! :-)

phofl · 2023-06-01T03:04:46Z

-1 on raising, falling back to numpy is a suitable alternative in a bunch of scenarios for now

csubhodeep · 2023-06-01T08:44:42Z

-1 on raising, falling back to numpy is a suitable alternative in a bunch of scenarios for now

@phofl If I understood you correctly you mean that we should rather refrain from using pyarrow backed dtypes in most (if not all) operations?

phofl · 2023-06-01T14:22:50Z

No, we are using numpy internally in a couple of methods right now, because there isn’t a arrow implementation available yet and that is fine since the conversion is cheap there

lithomas1 · 2023-06-02T22:02:28Z

Is it just explode where functionality is missing?

I think most of the methods should have Arrow implementations, so it's probably easier to just fix the ones that don't that can be fixed.

csubhodeep · 2023-06-03T19:21:19Z

@lithomas1

Is it just explode where functionality is missing?

Sorry but haven't checked for all of them.

mroeschke · 2023-06-13T22:18:47Z

Going to change this issue to be explode specific. I don't think it's really feasible to warn on every methods that don't support arrow now; and the time is better spent supporting the method instead

lukemanley · 2023-06-30T01:47:16Z

Going to change this issue to be explode specific.

@mroeschke - With the scope reduced to explode, would you say #53602 closes it?

mroeschke · 2023-06-30T17:44:53Z

Ah right I believe #53602 closes this

csubhodeep added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2023

lithomas1 added Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Jun 2, 2023

csubhodeep changed the title ~~ENH: Would be nice to have a warning message when using functions that are not support on pyarrow dtypes~~ ENH: Would be nice to have a warning message when using functions that are not supported on pyarrow dtypes Jun 4, 2023

mroeschke changed the title ~~ENH: Would be nice to have a warning message when using functions that are not supported on pyarrow dtypes~~ ENH: Support explode for ArrowDtype Jun 13, 2023

mroeschke added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 13, 2023

mroeschke closed this as completed Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support explode for ArrowDtype #53373

ENH: Support explode for ArrowDtype #53373

csubhodeep commented May 24, 2023 •

edited

Loading

csubhodeep commented May 31, 2023

phofl commented Jun 1, 2023

csubhodeep commented Jun 1, 2023 •

edited

Loading

phofl commented Jun 1, 2023

lithomas1 commented Jun 2, 2023

csubhodeep commented Jun 3, 2023

mroeschke commented Jun 13, 2023

lukemanley commented Jun 30, 2023

mroeschke commented Jun 30, 2023

ENH: Support explode for ArrowDtype #53373

ENH: Support explode for ArrowDtype #53373

Comments

csubhodeep commented May 24, 2023 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

csubhodeep commented May 31, 2023

phofl commented Jun 1, 2023

csubhodeep commented Jun 1, 2023 • edited Loading

phofl commented Jun 1, 2023

lithomas1 commented Jun 2, 2023

csubhodeep commented Jun 3, 2023

mroeschke commented Jun 13, 2023

lukemanley commented Jun 30, 2023

mroeschke commented Jun 30, 2023

csubhodeep commented May 24, 2023 •

edited

Loading

csubhodeep commented Jun 1, 2023 •

edited

Loading