Skip to content

PERF: skip non-consolidatable blocks when checking consolidation #32826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 19, 2020

Conversation

jorisvandenbossche
Copy link
Member

This skips the non-consolidatable blocks to determine if the blocks are consolidated. So meaning that if you have multiple EA columns with the same dtype, we do not do an unnecessary consolidation.

From investigating #32196 (comment)

@rth this should give another speed-up to your benchmark

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Mar 19, 2020
@jorisvandenbossche
Copy link
Member Author

With:

arrays = [pd.arrays.SparseArray(np.random.randint(0, 2, 1000), dtype="float64") for _ in range(10000)]
index = pd.Index(range(len(arrays[0])))  
columns = pd.Index(range(len(arrays)))    

it gives

In [4]: %timeit pd.DataFrame._from_arrays(arrays, index=index, columns=columns)  
122 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- PR
249 ms ± 8.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <-- master

@rth
Copy link
Contributor

rth commented Mar 19, 2020

Very nice @jorisvandenbossche , thanks!

I can confirm that when cherry-picked on top of #32825 this gives another 2x speed-up,

[100.00%] ··· sparse.SparseDataFrameConstructor.time_from_scipy                                                               115±0.5ms
       before           after         ratio
     [dbd7a5d3]       [c2476380]
     <master>         <tmp-sparse>
-       115±0.5ms       14.1±0.2ms     0.12  sparse.SparseDataFrameConstructor.time_from_scipy

So around 14x speed-up for these 3 PRs.

@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Mar 19, 2020
@jbrockmendel
Copy link
Member

Looks like a nice speedup. Couple of questions, non-blockers:

  • are the improvements specific to SparseArray?
  • if we exclude non-consolidateable, can we check dtype instead of ftype?
  • does caching ftype make a difference?

@jorisvandenbossche
Copy link
Member Author

Not specific to SparseArray, it's for all non-consolidatable blocks (which is now just ExtensionBlock I think?)
It's just that creating a DataFrame from sparse matrix is a case where you only have extension blocks, and thus were this is relatively more costly.

I suppose that we can get rid of ftype entirely, as indeed, that was only used in the past to differentiate dense and sparse dtypes, but now sparse dtypes are ExtensionDtypes, and not numpy dtypes, so we don't need that anymore.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel updated with removing ftype, thanks for the questions!

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - nice cleanup

@jreback jreback merged commit dbd24ad into pandas-dev:master Mar 19, 2020
@jreback
Copy link
Contributor

jreback commented Mar 19, 2020

thanks @jorisvandenbossche

yeah we had removed ftypes in 1.0, so this was a leftover i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants