Skip to content

Pandas wrongly read csr_matrix with explicit 0 #28992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PrimozGodec opened this issue Oct 15, 2019 · 7 comments
Closed

Pandas wrongly read csr_matrix with explicit 0 #28992

PrimozGodec opened this issue Oct 15, 2019 · 7 comments
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@PrimozGodec
Copy link

PrimozGodec commented Oct 15, 2019

Code Sample, a copy-pastable example if possible

>>> a = csr_matrix([[1, 0, 0], [2, 2, 0], [0, 2, 2]])
>>> a[0, 2] = 0
>>> a.todense()
matrix([[1, 0, 0],
        [2, 2, 0],
        [0, 2, 2]], dtype=int64)

>>> pd.DataFrame.sparse.from_spmatrix(a)
   0  1  2
0  1  0  0
1  2  2  0
2  0  2  0

Observe the difference in the bottom left value in the second matrix.

Problem description

When the sparse matrix (in our case the matrix with the explicit zero) is imported in pandas with from_spmatrix function at least one value that should not be zero becomes a zero. See the example below. Both matrices should be the same.

Expected Output

When converting sparse matrix to Pandas DataFrame values of the sparse array should stay the same.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 40.8.0
Cython : None
pytest : 5.2.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.1

@andreasbuhr
Copy link
Contributor

The example code fails at:
a[0, 3] = 0
With an IndexError. Is this a typo?

@PrimozGodec
Copy link
Author

PrimozGodec commented Oct 16, 2019

@andreasbuhr Yep, it is a typo. It should be:
a[0, 2] = 0
I corrected the example above.

@andreasbuhr
Copy link
Contributor

thanks, now I was able to reproduce your problem with current master

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [30]: import scipy as sp

In [31]: >>> a = sp.sparse.csr_matrix([[1, 0, 0], [2, 2, 0], [0, 2, 2]])
    ...: >>> a[0, 2] = 0
/anaconda3/envs/pandas-dev/lib/python3.7/site-packages/scipy/sparse/_index.py:84: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_intXint(row, col, x.flat[0])

In [32]: pd.DataFrame.sparse.from_spmatrix(a)
Out[32]:
   0  1  2
0  1  0  0
1  2  2  0
2  0  2  2

In [33]: a.todense()
Out[33]:
matrix([[1, 0, 0],
        [2, 2, 0],
        [0, 2, 2]], dtype=int64)

In [34]: pd.__version__
Out[34]: '1.1.0.dev0+1361.g77a0f19c5'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Sparse Sparse Data Type labels Apr 25, 2020
@bakitybacon
Copy link

take

bakitybacon added a commit to bakitybacon/pandas that referenced this issue May 13, 2020
Test based on issue pandas-dev#28992, very similar to a previous test but with exact methods called to ensure specific issue has been addressed
@sebimarkgraf
Copy link

@bakitybacon Are you still working on this?
To me it seems like the test case from #31991 should already cover this scenario.

This issue could be closed as solved then.
If we decide to still add an additonal test, I would propose to add the position of the 0 as parameter to the other test case instead of introducing another test case.

@mroeschke
Copy link
Member

Yes it appears that this issue is covered by unit tests in #31991, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

6 participants