PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781

timlod · 2022-07-18T19:25:46Z

Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy.

Finally, bypass validation logic (unneeded as validated on parquet write) by initializing NDArrayBacked instead of StringArray.

This removes most of the performance overhead seen in #47345. There is still a slight overhead when comparing to object string arrays because of None -> NA conversion. I found that leaving that out still results in NA types in the example I gave (and would actually improve performance over the object case), but this is not consistent and thus conversion is left in.

closes PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes #47345
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy. Finally, we bypass validation the validation logic by initializing NDArrayBacked instead of StringArray.

phofl

Is this compatible with the minimal pyarrow version we are supporting?

timlod · 2022-07-20T08:28:29Z

Good point, I hadn't considered this. No - afaict this code requires pyarrow 3.0 (pyarrow.concat_arrays as well as array.to_numpy(zero_copy_only=False) were both introduced in 3.0, whereas https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html states pyarrow 1.0.1 as the minimum version.

I understand this performance issue alone may not be sufficient reason to bump a version, but, in general, what would be the requirements for that? There's just 5 months between the two releases.
Edit: Looking at how pyarrow has been bumped in accordance with new pandas versions in the past, it feels like moving to pyarrow 3 for pandas 1.5 could be reasonable (current pyarrow is version 8).

phofl · 2022-08-05T15:59:14Z

Could you open an issue about bumping pyarrow? The we can discuss there and move forward from that

phofl · 2022-08-26T21:42:56Z

Any way doing this without requiring 3.0? Otherwise would have to wait for a bit

timlod · 2022-08-29T07:42:03Z

I think it's possible to implement something that's already a little better than what's on 1.4 without requiring pyarrow 3.
However, it's probably wise to switch to how it's done in this PR once pandas does require pa3.
I could make another PR later this week, if that's not too late for this release - and this one could be kept open for 1.5.1.

phofl · 2022-08-29T07:46:28Z

Depends on the nature of the change, we don’t backport anything big to a release candidate.

this one would have to wait for 1.6, we avoid Performance things on 1.5.x

timlod · 2022-09-03T09:43:47Z

In that case, I think it's fine to just wait for 1.6 and make this change directly. One can work around the performance impact by using object strings until then.

github-actions · 2022-10-04T00:09:49Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

phofl · 2022-10-17T20:24:05Z

We just increased the minimum version to 6.0, so we could finish this

timlod · 2022-10-18T07:30:23Z

Excellent, I'll revisit this soon!

Edit: I recently found that pyarrow's to_pandas() method can be the bottleneck when loading large parquet files that are read as large chunked arrays. I think implementing a similar logic (using pyarrow's own methods over concatenating lists of numpy arrays) for other datatypes might drastically improve read performance. Would it make sense to open a larger PR containing all those changes (if I can show improvements), or add those here?

mroeschke · 2022-10-19T16:22:17Z

Would it make sense to open a larger PR containing all those changes (if I can show improvements), or add those here?

Smaller, singular topic scoped PRs would be preferred

timlod · 2022-10-23T12:35:32Z

I think this is ready then - I just changed the whatsnew edit, the code change stays the same.

I also briefly checked what I thought might have improved performance across the other dtypes, but this wasn't so. There may be some parts where one could switch to pyarrow concatenation, but those that I checked (integer numerical) didn't yield performance improvements (and may result in some memory overhead).

pandas/core/arrays/string_.py

phofl · 2023-01-19T22:01:32Z

Can you merge main?

simonjayhawkins · 2023-02-22T16:10:03Z

@timlod there is a merge conflict here but since the rc is now cut this would probably need the release note moved to 2.1

mroeschke · 2023-02-24T18:08:23Z

Thanks for sticking with this @timlod

timlod added 3 commits July 18, 2022 18:38

Handle zero-chunks correctly & convert None to NA

9bb0312

Add change to whatsnew

b538260

timlod changed the title ~~Bypass chunking/validation logic in StringDtype__from_arrow__~~ PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ Jul 19, 2022

phofl reviewed Jul 20, 2022

View reviewed changes

mroeschke added Performance Memory or execution speed performance Strings String extension data type and string data Arrow pyarrow functionality labels Jul 22, 2022

timlod mentioned this pull request Aug 9, 2022

Update pyarrow dependency from 1.0.1 to 3.0 #48014

Closed

github-actions bot added the Stale label Oct 4, 2022

mroeschke mentioned this pull request Oct 14, 2022

DEPS: Bump PyArrow to 6.0 #49096

Merged

5 tasks

Merge branch 'main' into opt-stringarray-from-arrow

9f102af

lithomas1 removed the Stale label Oct 23, 2022

Add to v2 whatsnew

7176086

mroeschke reviewed Oct 24, 2022

View reviewed changes

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Nov 30, 2022

View reviewed changes

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved

timlod added 2 commits November 30, 2022 14:31

Add GH issue to comment about validation bypass

3d1162e

Add # to GH issue

c2d131a

Merge branch 'main' into opt-stringarray-from-arrow

e5028ff

timlod added 2 commits February 24, 2023 13:04

Merge branch 'main' into opt-stringarray-from-arrow

af5fbef

Move release note to v2.1.0

563e247

mroeschke approved these changes Feb 24, 2023

View reviewed changes

mroeschke added this to the 2.1 milestone Feb 24, 2023

mroeschke merged commit 129108f into pandas-dev:main Feb 24, 2023

jorisvandenbossche mentioned this pull request Oct 20, 2023

BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #55606

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781

PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781

timlod commented Jul 18, 2022

phofl left a comment

timlod commented Jul 20, 2022 •

edited

Loading

phofl commented Aug 5, 2022

phofl commented Aug 26, 2022

timlod commented Aug 29, 2022

phofl commented Aug 29, 2022

timlod commented Sep 3, 2022

github-actions bot commented Oct 4, 2022

phofl commented Oct 17, 2022

timlod commented Oct 18, 2022 •

edited

Loading

mroeschke commented Oct 19, 2022

timlod commented Oct 23, 2022

phofl commented Jan 19, 2023

simonjayhawkins commented Feb 22, 2023

mroeschke commented Feb 24, 2023

PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781

PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781

Conversation

timlod commented Jul 18, 2022

phofl left a comment

Choose a reason for hiding this comment

timlod commented Jul 20, 2022 • edited Loading

phofl commented Aug 5, 2022

phofl commented Aug 26, 2022

timlod commented Aug 29, 2022

phofl commented Aug 29, 2022

timlod commented Sep 3, 2022

github-actions bot commented Oct 4, 2022

phofl commented Oct 17, 2022

timlod commented Oct 18, 2022 • edited Loading

mroeschke commented Oct 19, 2022

timlod commented Oct 23, 2022

phofl commented Jan 19, 2023

simonjayhawkins commented Feb 22, 2023

mroeschke commented Feb 24, 2023

timlod commented Jul 20, 2022 •

edited

Loading

timlod commented Oct 18, 2022 •

edited

Loading