-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.sparse.from_spmatrix
hard codes an invalid fill_value
for certain subtypes
#59064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.sparse.from_spmatrix
hard codes an invalid fill_value
for certain subtypes
#59064
Conversation
@@ -313,7 +331,7 @@ def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame: | |||
indices = data.indices | |||
indptr = data.indptr | |||
array_data = data.data | |||
dtype = SparseDtype(array_data.dtype, 0) | |||
dtype = SparseDtype(array_data.dtype, fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use na_value_for_dtype
instead of introducing a new argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matthew. Thank you for your speedy response and taking the time to review my PR.
na_value_for_dtype
is indirectly called in this implementation with the default argument of fill_value
as None
when passed to the SparseDtype
constructor.
pandas/pandas/core/dtypes/dtypes.py
Lines 1746 to 1747 in a5e812d
if fill_value is None: | |
fill_value = na_value_for_dtype(dtype) |
A fill_value
parameter is typically only needed when constructing a sparse format object from a dense format object, as we need to identify non-zero elements in the data to set attributes like the data, index, and index pointer (for sparse array formats like BSR, CSR, and CSC) in the sparse format object correctly. In this case, we are simply accessing these attributes from a CSC matrix, so you are right in that a fill_value
parameter is not strictly required to solve the bug.
However, in addition to fixing the bug and not adding overhead, it is handy to have a fill_value
parameter to give the user flexibility in certain use cases, for example, converting a SparseArray
to a np.ndarray
using np.asarray
will use fill_value
to populate the missing elements. This is along the lines of how, albeit not a sparse implementation, np.ma.core.MaskedArray
has a filled
method that performs the same functionality of using a custom fill_value
to convert to a np.ndarray
.
Would you be OK with me keeping it?
The current test failures also seem unrelated to the PR and have been around since #59027.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core maintainers have been cautious about expanding the APIs surface areas and signatures unless necessary or for consistency issues. I would prefer if this bug was solved without adding a new keyword argument first, and then you could open a new issue about adding a new keyword here that has opt-in from more core team memebers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand and yes, the cautious approach makes total sense given the size of pandas.
The issues that I see arising as a result of removing the fill_value
parameter and implementing your change are:
DataFrame.sparse.from_spmatrix().sparse.to_coo
, shown as an example in the user guide, will break forfloat
andcomplex
subtypes as it will raise aValueError
. It seems like the original thought of adding this was because a customfill_value
will be lost upon converting to a COO matrix asscipy.sparse._coo.coo_matrix
and other sparse formats do not have an analogous attribute and instead useFalse
,0
,0.
, and0. + 0.j
when returning a dense representation of them as anp.ndarray
ornp.matrix
, which was considered unexpected behaviour at the time. 🤷 We can remove this check as the constructor called into_coo
is using the ijv format and not directly instantiating using a single 2-Dnp.ndarray
, so the COO matrix returned will be correct regardless of thefill_value
.pandas/doc/source/user_guide/sparse.rst
Lines 191 to 200 in b1e5f06
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr) sdf.head() sdf.dtypes All sparse formats are supported, but matrices that are not in :mod:`COOrdinate <scipy.sparse>` format will be converted, copying data as needed. To convert back to sparse SciPy matrix in COO format, you can use the :meth:`DataFrame.sparse.to_coo` method: .. ipython:: python sdf.sparse.to_coo()
pandas/pandas/core/arrays/sparse/accessor.py
Lines 396 to 397 in b1e5f06
if sp_arr.fill_value != 0: raise ValueError("fill value must be 0 when converting to COO matrix") - All of the tests that rely on a
DataFrame.sparse.from_spmatrix
invocation, apart from the one that I added to test the changes,test_from_spmatrix_fill_value
, assume afill_value
of0
. The expectations will all have to be changed and will subsequently have a dependency onna_value_for_dtype
for thefill_value
to use, and the docstring example will have to change.
Would you like me to make the changes above, close the PR, or keep the fill_value
parameter as it is currently implemented in my branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes above would be appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. It should be rather straightforward and quick in that case! I will make the changes now.
Co-authored-by: Matthew Roeschke <[email protected]>
Thanks @christopher-titchen |
This seems to be responsible for a breaking change in a workflow of mine. We consume the output of a sklearn OneHotEncoder, which is sparse with float type, and instantiate a sparse pandas frame from it. That used to produce values of 1.0 and 0.0, and now produces instead 1.0 and np.nan. It doesn't look like the sparse instantiation allows the fill_value; is there another easy way we can adjust to the new behavior? (Casting to integers would be fine for this particular case, although our code is more generic than just OneHotEncoder results, so I'm not positive that's generalizable.) |
@christopher-titchen thanks. Filling with zero (after) is a reasonable alternative to casting to integers (before), since apparently pandas is clever enough to change the fill_value. But the broader-setting problem (that my last sentence before hinted at) persists; I should've made that the scope to start with. I'll make a new issue with that in mind. |
DataFrame.sparse.from_spmatrix
hard codes an invalidfill_value
for certain subtypes #59063.doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.