-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858
Conversation
|
Very nice, thanks @jorisvandenbossche ! It's good that is applies to extension arrays in general not just sparse frames. Though I guess the use case of a very large number of columns (>10k) is less common outside of sparse. |
Actually my comment was more about #32856, got confused in your multiple performance improvement PRs :) This is nice too! |
pandas/core/frame.py
Outdated
Optional dtype to enforce for all arrays. | ||
verify_integrity : bool, default True | ||
Validate and homogenize all input. If set to False, it is assumed | ||
that all elements of `arrays` are actual arrays to be stored in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "actual arrays" do you mean numpy ndarray or pandas arrays? Might be worth specifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of both. Basically the array as how it is stored in a block. Will mention it needs to be one of both.
I think your comment applies to both ;) |
Nice docstring. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter too much, can be added either here or there. |
SGTM. I think this can go in since CI is passing :) |
For cases where you know to have valid data (eg you just created them yourself, or they are already validated), it can be useful to skip the validation checks when creating a DataFrame from arrays.
Use case is for example #32825
From investigating #32196 (comment)
@rth this gives another 20% improvement on the dataframe creation part. Together with #32856, it gives a bit more than a 2x improvement on the dataframe creation part (once the sparse arrays are created)