Skip to content

BUG: Fix wrong reading sparse matrix #31991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 16, 2020
Merged

BUG: Fix wrong reading sparse matrix #31991

merged 7 commits into from
Apr 16, 2020

Conversation

m7142yosuke
Copy link
Contributor

@m7142yosuke m7142yosuke commented Feb 15, 2020

@m7142yosuke m7142yosuke changed the title Fix wrong reading sparse matrix BUG: Fix wrong reading sparse matrix Feb 15, 2020
@dwhswenson
Copy link

Previously the comment in #29814 had suggested using data.indices instead of data.nonzero(). However, this PR instead strips any explicit zeros in the sparse input.

One of my use cases involves taking the difference of two sparse matrices, and then passing the result to scipy.sparse and pandas (ideally filling missing with NaN, instead of 0). There's a meaningful difference between "this element has no information from either matrix" (NaN) and "this element had the exact same value in both matrices" (0). It looks to me like this PR would obscure that distinction.

This would also make it hard to round-trip from scipy.sparse, which seems to handle explicit zeros.

Is there a solution that doesn't involve losing the information about explicit zeros?

(This wouldn't be a show-stopper for us; right now export to pandas is basically a convenience for users who want it. We'd just have to add a warning on that function.)

@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2020

@jorisvandenbossche or @TomAugspurger any thoughts here?

@WillAyd WillAyd added the Sparse Sparse Data Type label Mar 14, 2020
@TomAugspurger
Copy link
Contributor

What’s the performance cost of your PR be using indicies? It seems like stripping explicit zeros will require scanning the full data?

@m7142yosuke
Copy link
Contributor Author

@TomAugspurger
Eliminating explicit zeros will require scanning the full data in my PR.
I didn't come up with other way(not scanning the full data). Is there another good way?

@jorisvandenbossche
Copy link
Member

@m7142yosuke as I also mentioned on the issue, there has been a PR that refactored the from_spmatrix dataframe accessor method: #32825

This seems to have fixed the DataFrame bug (it has its own logic now, and does not use SparseArray.from_spmatrix under the hood), but so the SparseArray is still buggy (what you are fixing here).
But so, it's probably worth to take a look at how that PR changed the implementation, and do something similar here (or try to see if some of the code can be shared).

Also adding a test case for the original case from #29814 would be welcome

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update to the comments on the PR, e.g. @jorisvandenbossche mentioned an explicit test & add a whatsnew note in 1.1., sparse bug fix section. pls merge master and ping on green.

@m7142yosuke m7142yosuke requested a review from jreback April 11, 2020 03:35
@m7142yosuke
Copy link
Contributor Author

@jorisvandenbossche @jreback
Please review this PR

Done.

@TomAugspurger TomAugspurger added this to the 1.1 milestone Apr 16, 2020
@jreback jreback merged commit 91dcc3a into pandas-dev:master Apr 16, 2020
@jreback
Copy link
Contributor

jreback commented Apr 16, 2020

thanks @m7142yosuke

@m7142yosuke m7142yosuke deleted the fix-wrong-reading-sparse-matrix branch April 16, 2020 22:16
CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Apr 20, 2020
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas wrongly read Scipy sparse matrix
6 participants