BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

melkonyan · 2021-04-18T18:14:48Z

Code to reproduce:

import pandas as pd
from scipy import sparse as sc
import numpy as np

np.random.seed(42)
vals = np.random.randint(0, 10, size=(1000, 1000))
keep = vals > 3
vals[keep] = 0
sparse_mtx = sc.coo_matrix(vals)
sparse_pd = pd.DataFrame.sparse.from_spmatrix(sparse_mtx)

num_tries = 30
t1 = timeit.timeit(lambda: sparse_pd.to_csv('sparse_pd.csv'), number=num_tries)
t2 = timeit.timeit(lambda: sparse_pd.sparse.to_dense().to_csv('sparse_pd.csv'), number=num_tries)

overhead = t1/t2

print(t1, t2, overhead)

Output:

56.591012510471046 3.7841985523700714 14.954556883657089

Versions:

python == 3.9.2
pandas == 1.2.4

The text was updated successfully, but these errors were encountered:

fangchenli · 2021-04-18T21:01:03Z

About 10x on M1 mbp.

jorisvandenbossche · 2021-04-20T06:44:19Z

Thanks for the report!

In this case, pandas should probably just convert to dense before writing to CSV itself.
But doing a quick profile of the to_csv with a sparse dataframe, also highlights some other aspects that are slow right now and might be worth optimizing as well, regardless of to_csv:

Slicing rows of a DataFrame with sparse columns (this takes ca 40% of sparse_pd.to_csv(..), because it chunks the dataframe before writing). This currently falls back to a take operation, while special casing slicing probably can be done more efficiently.
to_native_types for SparseArray is another 45% of the overall time, and large part of that comes from the SparseArray.astype implementation

melkonyan · 2021-04-20T09:00:38Z

Hi, thanks for reply.

I've conducted more tests, seems like slowdown grows as the dataframe gets bigger (I got ~ 15x slowdown for 1M elements, ~26x for 4M and ~60x for 16M). That's very surprising. It's also surprising that slowdown doesn't seem to be affected by density (ratio of non-zero elements), at least in my tests. Something must be wrong in the serialization algorithm.

In regard to converting to dense matrix before serializing. that will add another problem: the data I'm working with right now is too large to fit into memory in dense form (350k x 20k matrix). So ideally, we'll either be able to speed up without conversion or convert smaller chunks of the dataframe.

taytzehao · 2021-05-07T08:49:55Z

Based on @jorisvandenbossche first point above, the take operation is slow as it needs to create a new instance of the SparseArray every time it is called.

melkonyan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2021

fangchenli added Performance Memory or execution speed performance Sparse Sparse Data Type and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2021

lithomas1 added the IO CSV read_csv, to_csv label Apr 18, 2021

jorisvandenbossche added this to the Contributions Welcome milestone Apr 20, 2021

mzeitlin11 mentioned this issue Sep 19, 2021

PERF: sparse take #43654

Merged

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rtlee9 mentioned this issue Oct 13, 2022

PERF: sparse to_csv #49066

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

melkonyan commented Apr 18, 2021

fangchenli commented Apr 18, 2021

jorisvandenbossche commented Apr 20, 2021

melkonyan commented Apr 20, 2021

taytzehao commented May 7, 2021

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023

Comments

melkonyan commented Apr 18, 2021

fangchenli commented Apr 18, 2021

jorisvandenbossche commented Apr 20, 2021

melkonyan commented Apr 20, 2021

taytzehao commented May 7, 2021