Skip to content

PERF: concat of pyarrow string array does not rechunk #42357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jreback opened this issue Jul 3, 2021 · 4 comments
Open

PERF: concat of pyarrow string array does not rechunk #42357

jreback opened this issue Jul 3, 2021 · 4 comments
Labels
Arrow pyarrow functionality Enhancement Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@jreback
Copy link
Contributor

jreback commented Jul 3, 2021

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '1.4.0.dev0+143.g5675cd8ab2'

In [4]: s = pd.concat([pd.Series(list('abc'))] * 100_000)

In [5]: %timeit s.str.upper()
48.6 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: s1 = pd.concat([pd.Series(list('abc'), dtype='string[pyarrow]')] * 100_000)

In [7]: %timeit s1.str.upper()
308 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: s2 = s.astype('string[pyarrow]')

In [11]: %timeit s2.str.upper()
1.38 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [12]: s1._mgr.blocks[0].values._data.num_chunks
Out[12]: 100000

In [13]: s2._mgr.blocks[0].values._data.num_chunks
Out[13]: 1

If you naively concat series, then the pyarrow storage is not-rechunked. This can lead to dramatic performance issues.

e.g. you get 100k chunks above, vs 1 in the astype operation.

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Jul 3, 2021
@jreback jreback added this to the 1.3.1 milestone Jul 3, 2021
@jreback
Copy link
Contributor Author

jreback commented Jul 3, 2021

@jorisvandenbossche
Copy link
Member

I think in general it is good that concat can be done without a copy of the data, but we should probably have some "rechunking" facilities and some logic to decide when to rechunk and when not.

Based on a minimum target number of elements for a single chunk, we could check after the concat whether there is any chunk below that number (or more cheaply if the average is below that number, i.e. (_data.length() / _data.num_chunks) < MIN_ELEMENTS).

Pyarrow has a ChunkedArray.combine_chunks, but that will flatten into a single non-chunked array, which is not necessarily what we want for large data. So we will have to write something custom (although we could maybe also upstream it to pyarrow).

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 10, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@simonjayhawkins
Copy link
Member

I think in general it is good that concat can be done without a copy of the data, but we should probably have some "rechunking" facilities and some logic to decide when to rechunk and when not.

adding enhancement and needs discussion labels and moving off 1.3.5 milestone.

@simonjayhawkins simonjayhawkins added Enhancement Needs Discussion Requires discussion from core team before further action labels Nov 27, 2021
@simonjayhawkins simonjayhawkins removed this from the 1.3.5 milestone Nov 27, 2021
@jbrockmendel jbrockmendel added the Arrow pyarrow functionality label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement Needs Discussion Requires discussion from core team before further action Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants