Skip to content

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
melkonyan opened this issue Apr 18, 2021 · 4 comments
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance Sparse Sparse Data Type

Comments

@melkonyan
Copy link

Code to reproduce:

import pandas as pd
from scipy import sparse as sc
import numpy as np

np.random.seed(42)
vals = np.random.randint(0, 10, size=(1000, 1000))
keep = vals > 3
vals[keep] = 0
sparse_mtx = sc.coo_matrix(vals)
sparse_pd = pd.DataFrame.sparse.from_spmatrix(sparse_mtx)

num_tries = 30
t1 = timeit.timeit(lambda: sparse_pd.to_csv('sparse_pd.csv'), number=num_tries)
t2 = timeit.timeit(lambda: sparse_pd.sparse.to_dense().to_csv('sparse_pd.csv'), number=num_tries)

overhead = t1/t2

print(t1, t2, overhead)

Output:

56.591012510471046 3.7841985523700714 14.954556883657089

Versions:

  • python == 3.9.2
  • pandas == 1.2.4
@melkonyan melkonyan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2021
@fangchenli fangchenli added Performance Memory or execution speed performance Sparse Sparse Data Type and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2021
@fangchenli
Copy link
Member

About 10x on M1 mbp.

@lithomas1 lithomas1 added the IO CSV read_csv, to_csv label Apr 18, 2021
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Apr 20, 2021
@jorisvandenbossche
Copy link
Member

Thanks for the report!

In this case, pandas should probably just convert to dense before writing to CSV itself.
But doing a quick profile of the to_csv with a sparse dataframe, also highlights some other aspects that are slow right now and might be worth optimizing as well, regardless of to_csv:

  • Slicing rows of a DataFrame with sparse columns (this takes ca 40% of sparse_pd.to_csv(..), because it chunks the dataframe before writing). This currently falls back to a take operation, while special casing slicing probably can be done more efficiently.
  • to_native_types for SparseArray is another 45% of the overall time, and large part of that comes from the SparseArray.astype implementation

@melkonyan
Copy link
Author

Hi, thanks for reply.

I've conducted more tests, seems like slowdown grows as the dataframe gets bigger (I got ~ 15x slowdown for 1M elements, ~26x for 4M and ~60x for 16M). That's very surprising. It's also surprising that slowdown doesn't seem to be affected by density (ratio of non-zero elements), at least in my tests. Something must be wrong in the serialization algorithm.

In regard to converting to dense matrix before serializing. that will add another problem: the data I'm working with right now is too large to fit into memory in dense form (350k x 20k matrix). So ideally, we'll either be able to speed up without conversion or convert smaller chunks of the dataframe.

@taytzehao
Copy link
Contributor

Based on @jorisvandenbossche first point above, the take operation is slow as it needs to create a new instance of the SparseArray every time it is called.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rtlee9 rtlee9 mentioned this issue Oct 13, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants