PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

jorisvandenbossche · 2020-03-20T09:48:39Z

For cases where you know to have valid data (eg you just created them yourself, or they are already validated), it can be useful to skip the validation checks when creating a DataFrame from arrays.

Use case is for example #32825

From investigating #32196 (comment)

@rth this gives another 20% improvement on the dataframe creation part. Together with #32856, it gives a bit more than a 2x improvement on the dataframe creation part (once the sparse arrays are created)

jorisvandenbossche · 2020-03-20T09:49:44Z

In [1]: arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)] 
   ...: index = pd.Index(range(len(arrays[0])))   
   ...: columns = pd.Index(range(len(arrays)))

In [2]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)   
119 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns, verify_integrity=False)    
98.1 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

rth · 2020-03-20T09:59:55Z

Very nice, thanks @jorisvandenbossche ! It's good that is applies to extension arrays in general not just sparse frames. Though I guess the use case of a very large number of columns (>10k) is less common outside of sparse.

rth · 2020-03-20T10:01:49Z

Actually my comment was more about #32856, got confused in your multiple performance improvement PRs :) This is nice too!

rth · 2020-03-20T10:05:13Z

pandas/core/frame.py

+            Optional dtype to enforce for all arrays.
+        verify_integrity : bool, default True
+            Validate and homogenize all input. If set to False, it is assumed
+            that all elements of `arrays` are actual arrays to be stored in


By "actual arrays" do you mean numpy ndarray or pandas arrays? Might be worth specifying.

One of both. Basically the array as how it is stored in a block. Will mention it needs to be one of both.

jorisvandenbossche · 2020-03-20T10:05:38Z

I think your comment applies to both ;)

simonjayhawkins · 2020-03-20T10:14:00Z

Nice docstring. 😄

TomAugspurger

What order do we want to do this and #32825 in?

Seems like we merge this first and then update #32825 to pass verify_integrity=False?

jorisvandenbossche · 2020-03-20T13:32:34Z

Doesn't matter too much, can be added either here or there.
And I would also like to add verify_integrity to the benchmark that I am adding in #32856

TomAugspurger · 2020-03-20T13:35:06Z

SGTM. I think this can go in since CI is passing :)

…pandas-dev#32858)

PERF: allow to skip validation/sanitization in DataFrame._from_arrays

4502cdb

jorisvandenbossche added the Performance Memory or execution speed performance label Mar 20, 2020

jorisvandenbossche added this to the 1.1 milestone Mar 20, 2020

jorisvandenbossche mentioned this pull request Mar 20, 2020

PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

Merged

rth reviewed Mar 20, 2020

View reviewed changes

clarify array type

a0d1c27

TomAugspurger approved these changes Mar 20, 2020

View reviewed changes

jorisvandenbossche merged commit 3b406a3 into pandas-dev:master Mar 20, 2020

jorisvandenbossche deleted the perf-arrays-skip-sanitize branch March 20, 2020 20:06

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

PERF: allow to skip validation/sanitization in DataFrame._from_arrays (…

065accb

…pandas-dev#32858)

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020

PERF: allow to skip validation/sanitization in DataFrame._from_arrays (…

72822bd

…pandas-dev#32858)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

rth commented Mar 20, 2020

Uh oh!

rth commented Mar 20, 2020

Uh oh!

rth Mar 20, 2020

Uh oh!

jorisvandenbossche Mar 20, 2020

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

simonjayhawkins commented Mar 20, 2020

Uh oh!

TomAugspurger left a comment

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

TomAugspurger commented Mar 20, 2020

Uh oh!

Uh oh!

Uh oh!

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

Uh oh!

Conversation

jorisvandenbossche commented Mar 20, 2020

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

rth commented Mar 20, 2020

Uh oh!

rth commented Mar 20, 2020

Uh oh!

rth Mar 20, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 20, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

simonjayhawkins commented Mar 20, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

TomAugspurger commented Mar 20, 2020

Uh oh!

Uh oh!