Skip to content

ENH: Support explode for ArrowDtype #53373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
csubhodeep opened this issue May 24, 2023 · 9 comments
Closed
1 of 3 tasks

ENH: Support explode for ArrowDtype #53373

csubhodeep opened this issue May 24, 2023 · 9 comments
Labels
Arrow pyarrow functionality Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@csubhodeep
Copy link

csubhodeep commented May 24, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish for the functions that are still not supported an Exception is thrown. For instance, for the below snippet I try and use the explode method on two DataFrames one having pyarrow backed data type and another numpy.

import pyarrow as pa
import pyarrow.parquet as pq

pydict = {'x': [[1,2], [3]], 'y': ['a', 'b']}
table = pa.Table.from_pydict(pydict)
print(table.schema)
x: list<item: int64>
  child 0, item: int64
y: string
pq.write_table(table, 'test.parquet')
df_pa = pd.read_parquet('test.parquet', dtype_backend='pyarrow')
df = pd.read_parquet('test.parquet')

I make the DataFrames by reading from the parquet files because when doing in-memory using the convert_dtypes method it casts column x as object.

If I check schema

df_pa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype                     
---  ------  --------------  -----                     
 0   x       2 non-null      list<item: int64>[pyarrow]
 1   y       2 non-null      string[pyarrow]           
dtypes: list<item: int64>[pyarrow](1), string[pyarrow](1)
memory usage: 170.0 bytes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       2 non-null      object
 1   y       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes

If I try to extract one element from each

df_pa.iloc[0].values
array([list([1, 2]), 'a'], dtype=object)
df.iloc[0].values
array([array([1, 2]), 'a'], dtype=object)

I do note how the dtype is different in both.

FInally, if I try to call the explode method on:

  1. The numpy backed dtype
df.explode('x')

Output

   x  y
0  1  a
0  2  a
1  3  b
  1. The pyarrow backed dtype
df_pa.explode('x')

Output (without any error)

       x  y
0  [1 2]  a
1    [3]  b

So basically, this was not the result I was expecting.

P.S. I was not sure if this would rather qualify as a bug.

Platform details

Platform:    macOS-13.3.1-arm64-arm-64bit
Python:      3.9.16 (main, May 16 2023, 20:00:19) 
[Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy:       1.24.3
pandas:      2.0.1
pyarrow:     10.0.1

Feature Description

I don't know enough about the internals of pandas to propose anything here.

Alternative Solutions

For the functions which are not natively supported for pyarrow backed data frames could we internally convert them to the numpy one, do the operation and convert them back? In this case a warning message would make sense.

But again since I don't know enough whether such a make-shift solution meets the standards wrt performance/api design etc I would not be able to make a suggestion here.

Additional Context

No response

@csubhodeep csubhodeep added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 24, 2023
@csubhodeep
Copy link
Author

Also please confirm if the operations listed here are the only ones that are supported for the pyarrow backed dtypes. Thanks ! :-)

@phofl
Copy link
Member

phofl commented Jun 1, 2023

-1 on raising, falling back to numpy is a suitable alternative in a bunch of scenarios for now

@csubhodeep
Copy link
Author

csubhodeep commented Jun 1, 2023

-1 on raising, falling back to numpy is a suitable alternative in a bunch of scenarios for now

@phofl If I understood you correctly you mean that we should rather refrain from using pyarrow backed dtypes in most (if not all) operations?

@phofl
Copy link
Member

phofl commented Jun 1, 2023

No, we are using numpy internally in a couple of methods right now, because there isn’t a arrow implementation available yet and that is fine since the conversion is cheap there

@lithomas1 lithomas1 added Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member Enhancement labels Jun 2, 2023
@lithomas1
Copy link
Member

Is it just explode where functionality is missing?

I think most of the methods should have Arrow implementations, so it's probably easier to just fix the ones that don't that can be fixed.

@csubhodeep
Copy link
Author

@lithomas1

Is it just explode where functionality is missing?

Sorry but haven't checked for all of them.

@csubhodeep csubhodeep changed the title ENH: Would be nice to have a warning message when using functions that are not support on pyarrow dtypes ENH: Would be nice to have a warning message when using functions that are not supported on pyarrow dtypes Jun 4, 2023
@mroeschke
Copy link
Member

Going to change this issue to be explode specific. I don't think it's really feasible to warn on every methods that don't support arrow now; and the time is better spent supporting the method instead

@mroeschke mroeschke changed the title ENH: Would be nice to have a warning message when using functions that are not supported on pyarrow dtypes ENH: Support explode for ArrowDtype Jun 13, 2023
@mroeschke mroeschke added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 13, 2023
@lukemanley
Copy link
Member

Going to change this issue to be explode specific.

@mroeschke - With the scope reduced to explode, would you say #53602 closes it?

@mroeschke
Copy link
Member

Ah right I believe #53602 closes this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants