-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Serializing sparse dataframe 15x slower than converting to dense version + serializing the dense version #41023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
About 10x on M1 mbp. |
Thanks for the report! In this case, pandas should probably just convert to dense before writing to CSV itself.
|
Hi, thanks for reply. I've conducted more tests, seems like slowdown grows as the dataframe gets bigger (I got ~ 15x slowdown for 1M elements, ~26x for 4M and ~60x for 16M). That's very surprising. It's also surprising that slowdown doesn't seem to be affected by density (ratio of non-zero elements), at least in my tests. Something must be wrong in the serialization algorithm. In regard to converting to dense matrix before serializing. that will add another problem: the data I'm working with right now is too large to fit into memory in dense form (350k x 20k matrix). So ideally, we'll either be able to speed up without conversion or convert smaller chunks of the dataframe. |
Based on @jorisvandenbossche first point above, the take operation is slow as it needs to create a new instance of the SparseArray every time it is called. |
Code to reproduce:
Output:
Versions:
The text was updated successfully, but these errors were encountered: