ENH: Serialize view of ArrowStringArray #42600

mrocklin · 2021-07-19T00:32:53Z

Currently Pandas serializes views of ArrowStringArrays by serailizing the whole thing, rather than a subset. Here is an example:

In [1]: import pandas as pd

In [2]: s = pd.Series([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])

In [3]: s
Out[3]: 
0     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1     bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb...
2     cccccccccccccccccccccccccccccccccccccccccccccc...
3     dddddddddddddddddddddddddddddddddddddddddddddd...
4     eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee...
5     ffffffffffffffffffffffffffffffffffffffffffffff...
6     gggggggggggggggggggggggggggggggggggggggggggggg...
7     hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh...
8     iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii...
9     jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj...
10    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk...
11    llllllllllllllllllllllllllllllllllllllllllllll...
12    mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...
13    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
14    oooooooooooooooooooooooooooooooooooooooooooooo...
15    pppppppppppppppppppppppppppppppppppppppppppppp...
16    qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq...
17    rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr...
18    ssssssssssssssssssssssssssssssssssssssssssssss...
19    tttttttttttttttttttttttttttttttttttttttttttttt...
20    uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu...
21    vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv...
22    wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww...
23    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
24    yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy...
25    zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz...
dtype: object

In [4]: import pickle

In [5]: len(pickle.dumps(s))
Out[5]: 26758

In [6]: len(pickle.dumps(s.astype("string[pyarrow]")))
Out[6]: 26891

In [7]: len(pickle.dumps(s.head(5)))
Out[7]: 5632

In [8]: len(pickle.dumps(s.astype("string[pyarrow]").head(5)))
Out[8]: 26891

This negatively affects dask dataframe operations that cut up pandas dataframes into small pieces, moves them around to different computers, and then pieces them back together again.

TomAugspurger · 2021-07-20T12:28:34Z

This is also an upstream pyarrow issue:

In [1]: import pyarrow as pa

In [2]: import pickle

In [3]: a = pa.array([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])

In [4]: b = a[:5]

In [5]: len(pickle.dumps(a))
Out[5]: 26243

In [6]: len(pickle.dumps(a[:5]))
Out[6]: 26243

Seems that @maartenbreddels already noticed this and reported it upstream: https://issues.apache.org/jira/browse/ARROW-10739. It's nice sharing the same underlying technologies :)

I'm going to close this, since I don't think pandas could (easily) work around it and any fix would be better on the pyarrow side (cc @jorisvandenbossche, a bit more motivation if you have a chance to work on that issue).

TomAugspurger · 2021-07-20T12:31:34Z

Oh, I see that the workaround in Python isn't too bad. Something like

pa.concat_arrays([self._data])

in the __reduce__ for any arrow-backed array. We can consider implementing that.

maartenbreddels · 2021-07-20T12:34:27Z

Yes, that's the fastest version/workaround I've found.

maartenbreddels · 2021-07-20T12:48:18Z

I am actually using a different version right now: https://github.com/vaexio/vaex/blob/caed2cf106007c6a0141a02a5dfdf823fa38799e/packages/vaex-core/vaex/arrow/convert.py#L260

mrocklin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2021

TomAugspurger closed this as completed Jul 20, 2021

TomAugspurger reopened this Jul 20, 2021

simonjayhawkins added IO Parquet parquet, feather Strings String extension data type and string data IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2021

mroeschke added the Arrow pyarrow functionality label Oct 13, 2022

mroeschke mentioned this issue Oct 13, 2022

BUG: pickling subset of Arrow-backed data would serialize the entire data #49078

Merged

5 tasks

mroeschke closed this as completed in #49078 Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Serialize view of ArrowStringArray #42600

ENH: Serialize view of ArrowStringArray #42600

mrocklin commented Jul 19, 2021

TomAugspurger commented Jul 20, 2021

TomAugspurger commented Jul 20, 2021

maartenbreddels commented Jul 20, 2021

maartenbreddels commented Jul 20, 2021

ENH: Serialize view of ArrowStringArray #42600

ENH: Serialize view of ArrowStringArray #42600

Comments

mrocklin commented Jul 19, 2021

TomAugspurger commented Jul 20, 2021

TomAugspurger commented Jul 20, 2021

maartenbreddels commented Jul 20, 2021

maartenbreddels commented Jul 20, 2021