Skip to content

ENH: Serialize view of ArrowStringArray #42600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrocklin opened this issue Jul 19, 2021 · 4 comments · Fixed by #49078
Closed

ENH: Serialize view of ArrowStringArray #42600

mrocklin opened this issue Jul 19, 2021 · 4 comments · Fixed by #49078
Labels
Arrow pyarrow functionality Enhancement IO Parquet parquet, feather IO Pickle read_pickle, to_pickle Strings String extension data type and string data

Comments

@mrocklin
Copy link
Contributor

Currently Pandas serializes views of ArrowStringArrays by serailizing the whole thing, rather than a subset. Here is an example:

In [1]: import pandas as pd

In [2]: s = pd.Series([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])

In [3]: s
Out[3]: 
0     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1     bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb...
2     cccccccccccccccccccccccccccccccccccccccccccccc...
3     dddddddddddddddddddddddddddddddddddddddddddddd...
4     eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee...
5     ffffffffffffffffffffffffffffffffffffffffffffff...
6     gggggggggggggggggggggggggggggggggggggggggggggg...
7     hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh...
8     iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii...
9     jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj...
10    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk...
11    llllllllllllllllllllllllllllllllllllllllllllll...
12    mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...
13    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
14    oooooooooooooooooooooooooooooooooooooooooooooo...
15    pppppppppppppppppppppppppppppppppppppppppppppp...
16    qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq...
17    rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr...
18    ssssssssssssssssssssssssssssssssssssssssssssss...
19    tttttttttttttttttttttttttttttttttttttttttttttt...
20    uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu...
21    vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv...
22    wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww...
23    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
24    yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy...
25    zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz...
dtype: object

In [4]: import pickle

In [5]: len(pickle.dumps(s))
Out[5]: 26758

In [6]: len(pickle.dumps(s.astype("string[pyarrow]")))
Out[6]: 26891

In [7]: len(pickle.dumps(s.head(5)))
Out[7]: 5632

In [8]: len(pickle.dumps(s.astype("string[pyarrow]").head(5)))
Out[8]: 26891

This negatively affects dask dataframe operations that cut up pandas dataframes into small pieces, moves them around to different computers, and then pieces them back together again.

@mrocklin mrocklin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2021
@TomAugspurger
Copy link
Contributor

This is also an upstream pyarrow issue:

In [1]: import pyarrow as pa

In [2]: import pickle

In [3]: a = pa.array([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])

In [4]: b = a[:5]

In [5]: len(pickle.dumps(a))
Out[5]: 26243

In [6]: len(pickle.dumps(a[:5]))
Out[6]: 26243

Seems that @maartenbreddels already noticed this and reported it upstream: https://issues.apache.org/jira/browse/ARROW-10739. It's nice sharing the same underlying technologies :)

I'm going to close this, since I don't think pandas could (easily) work around it and any fix would be better on the pyarrow side (cc @jorisvandenbossche, a bit more motivation if you have a chance to work on that issue).

@TomAugspurger
Copy link
Contributor

Oh, I see that the workaround in Python isn't too bad. Something like

pa.concat_arrays([self._data])

in the __reduce__ for any arrow-backed array. We can consider implementing that.

@TomAugspurger TomAugspurger reopened this Jul 20, 2021
@maartenbreddels
Copy link

Yes, that's the fastest version/workaround I've found.

@maartenbreddels
Copy link

I am actually using a different version right now: https://github.com/vaexio/vaex/blob/caed2cf106007c6a0141a02a5dfdf823fa38799e/packages/vaex-core/vaex/arrow/convert.py#L260

@simonjayhawkins simonjayhawkins added IO Parquet parquet, feather Strings String extension data type and string data IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2021
@mroeschke mroeschke added the Arrow pyarrow functionality label Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO Parquet parquet, feather IO Pickle read_pickle, to_pickle Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants