-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Bypass chunking/validation logic in StringDtype__from_arrow__ #47781
Conversation
Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy. Finally, we bypass validation the validation logic by initializing NDArrayBacked instead of StringArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this compatible with the minimal pyarrow version we are supporting?
Good point, I hadn't considered this. No - afaict this code requires pyarrow 3.0 ( I understand this performance issue alone may not be sufficient reason to bump a version, but, in general, what would be the requirements for that? There's just 5 months between the two releases. |
Could you open an issue about bumping pyarrow? The we can discuss there and move forward from that |
Any way doing this without requiring 3.0? Otherwise would have to wait for a bit |
I think it's possible to implement something that's already a little better than what's on 1.4 without requiring pyarrow 3. |
Depends on the nature of the change, we don’t backport anything big to a release candidate. this one would have to wait for 1.6, we avoid Performance things on 1.5.x |
In that case, I think it's fine to just wait for 1.6 and make this change directly. One can work around the performance impact by using |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
We just increased the minimum version to 6.0, so we could finish this |
Excellent, I'll revisit this soon! Edit: I recently found that pyarrow's |
Smaller, singular topic scoped PRs would be preferred |
I think this is ready then - I just changed the whatsnew edit, the code change stays the same. I also briefly checked what I thought might have improved performance across the other dtypes, but this wasn't so. There may be some parts where one could switch to pyarrow concatenation, but those that I checked (integer numerical) didn't yield performance improvements (and may result in some memory overhead). |
Can you merge main? |
@timlod there is a merge conflict here but since the rc is now cut this would probably need the release note moved to 2.1 |
Thanks for sticking with this @timlod |
Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy.
Finally, bypass validation logic (unneeded as validated on parquet write) by initializing NDArrayBacked instead of StringArray.
This removes most of the performance overhead seen in #47345. There is still a slight overhead when comparing to
object
string arrays because of None -> NA conversion. I found that leaving that out still results in NA types in the example I gave (and would actually improve performance over theobject
case), but this is not consistent and thus conversion is left in.use_nullable_dtypes=True
inread_parquet
slows performance on large dataframes #47345doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.