Skip to content

Build empty SparseDataFrame by columns very loog compared to by index. #16197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cfrancois7 opened this issue May 2, 2017 · 2 comments
Closed
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type

Comments

@cfrancois7
Copy link

Code Sample, a copy-pastable example if possible

I want to create a sparse matrix with a 4 level multiindex and about 340 000 x 340 000 cells.
I is not possible to build it in dense and to sparse it.
So I tried to build it directly in SparseDataFrame.

n = len(index)
print(n)
>>> 338275
%timeit df0 = pd.SparseDataFrame(index=index)
1000 loops, best of 3: 419 µs per loop

But if I tried to construct:

df1 = pd.SparseDataFrame(columns=index)

or

df2 = pd.SparseDataFrame(index=index, columns=index)

An all night wasn't enough to build the empty SparseDataFrame.
I don't understand how to build this empty SparseDataFrame in a quite reseanoble time (less than <20 minutes with 8 GoRam).

Output of pd.show_versions()

``` commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0

</details>
@jreback
Copy link
Contributor

jreback commented May 2, 2017

can can try this (avail in 0.20.0 releasing shortly): http://pandas-docs.github.io/pandas-docs-travis/sparse.html#sparsedataframe.

though I suspect that may not work. pandas sparse is row based, NOT column based, so your size would blow up memory.

@jreback jreback closed this as completed May 2, 2017
@jreback jreback added the Sparse Sparse Data Type label May 2, 2017
@jreback jreback added this to the won't fix milestone May 2, 2017
@jreback jreback added the Performance Memory or execution speed performance label May 2, 2017
@cfrancois7
Copy link
Author

cfrancois7 commented May 2, 2017

Thank you.
Yes, I've already tried by importing a COO sparse matrix without success.
Indeed, it makes sense my approach didn't work because pandas sparse is row based.
I'm changing my approach by SpareSeries.

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue May 29, 2017
TomAugspurger pushed a commit that referenced this issue May 30, 2017
@TomAugspurger TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

3 participants