BUG: `DataFrame.sparse.from_spmatrix` hard codes an invalid `fill_value` for certain subtypes #59064

christophertitchen · 2024-06-21T10:41:01Z

Closes BUG: DataFrame.sparse.from_spmatrix hard codes an invalid fill_value for certain subtypes #59063.
Tests added and passed if fixing a bug or adding a new feature.
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

mroeschke · 2024-06-21T17:48:29Z

pandas/core/arrays/sparse/accessor.py

@@ -313,7 +331,7 @@ def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame:
        indices = data.indices
        indptr = data.indptr
        array_data = data.data
-        dtype = SparseDtype(array_data.dtype, 0)
+        dtype = SparseDtype(array_data.dtype, fill_value)


Can you use na_value_for_dtype instead of introducing a new argument?

Hi Matthew. Thank you for your speedy response and taking the time to review my PR.

na_value_for_dtype is indirectly called in this implementation with the default argument of fill_value as None when passed to the SparseDtype constructor.

pandas/pandas/core/dtypes/dtypes.py

Lines 1746 to 1747 in a5e812d

if fill_value is None:

fill_value = na_value_for_dtype(dtype)

A fill_value parameter is typically only needed when constructing a sparse format object from a dense format object, as we need to identify non-zero elements in the data to set attributes like the data, index, and index pointer (for sparse array formats like BSR, CSR, and CSC) in the sparse format object correctly. In this case, we are simply accessing these attributes from a CSC matrix, so you are right in that a fill_value parameter is not strictly required to solve the bug.

However, in addition to fixing the bug and not adding overhead, it is handy to have a fill_value parameter to give the user flexibility in certain use cases, for example, converting a SparseArray to a np.ndarray using np.asarray will use fill_value to populate the missing elements. This is along the lines of how, albeit not a sparse implementation, np.ma.core.MaskedArray has a filled method that performs the same functionality of using a custom fill_value to convert to a np.ndarray.

Would you be OK with me keeping it?

The current test failures also seem unrelated to the PR and have been around since #59027.

The core maintainers have been cautious about expanding the APIs surface areas and signatures unless necessary or for consistency issues. I would prefer if this bug was solved without adding a new keyword argument first, and then you could open a new issue about adding a new keyword here that has opt-in from more core team memebers

I understand and yes, the cautious approach makes total sense given the size of pandas.

The issues that I see arising as a result of removing the fill_value parameter and implementing your change are:

DataFrame.sparse.from_spmatrix().sparse.to_coo, shown as an example in the user guide, will break for float and complex subtypes as it will raise a ValueError. It seems like the original thought of adding this was because a custom fill_value will be lost upon converting to a COO matrix as scipy.sparse._coo.coo_matrix and other sparse formats do not have an analogous attribute and instead use False, 0, 0., and 0. + 0.j when returning a dense representation of them as a np.ndarray or np.matrix, which was considered unexpected behaviour at the time. 🤷 We can remove this check as the constructor called in to_coo is using the ijv format and not directly instantiating using a single 2-D np.ndarray, so the COO matrix returned will be correct regardless of the fill_value.

pandas/doc/source/user_guide/sparse.rst

Lines 191 to 200 in b1e5f06

sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)

sdf.head()

sdf.dtypes

All sparse formats are supported, but matrices that are not in :mod:`COOrdinate <scipy.sparse>` format will be converted, copying data as needed.

To convert back to sparse SciPy matrix in COO format, you can use the :meth:`DataFrame.sparse.to_coo` method:

.. ipython:: python

sdf.sparse.to_coo()

pandas/pandas/core/arrays/sparse/accessor.py

Lines 396 to 397 in b1e5f06

if sp_arr.fill_value != 0:

raise ValueError("fill value must be 0 when converting to COO matrix")

All of the tests that rely on a DataFrame.sparse.from_spmatrix invocation, apart from the one that I added to test the changes, test_from_spmatrix_fill_value, assume a fill_value of 0. The expectations will all have to be changed and will subsequently have a dependency on na_value_for_dtype for the fill_value to use, and the docstring example will have to change.

Would you like me to make the changes above, close the PR, or keep the fill_value parameter as it is currently implemented in my branch?

I think the changes above would be appropriate

Excellent. It should be rather straightforward and quick in that case! I will make the changes now.

pandas/core/dtypes/dtypes.py

Co-authored-by: Matthew Roeschke <[email protected]>

mroeschke · 2024-06-27T15:13:54Z

Thanks @christopher-titchen

bmreiniger · 2024-07-02T18:54:58Z

This seems to be responsible for a breaking change in a workflow of mine. We consume the output of a sklearn OneHotEncoder, which is sparse with float type, and instantiate a sparse pandas frame from it. That used to produce values of 1.0 and 0.0, and now produces instead 1.0 and np.nan.

It doesn't look like the sparse instantiation allows the fill_value; is there another easy way we can adjust to the new behavior? (Casting to integers would be fine for this particular case, although our code is more generic than just OneHotEncoder results, so I'm not positive that's generalizable.)

christophertitchen · 2024-07-04T10:30:22Z

This seems to be responsible for a breaking change in a workflow of mine. We consume the output of a sklearn OneHotEncoder, which is sparse with float type, and instantiate a sparse pandas frame from it. That used to produce values of 1.0 and 0.0, and now produces instead 1.0 and np.nan.

It doesn't look like the sparse instantiation allows the fill_value; is there another easy way we can adjust to the new behavior? (Casting to integers would be fine for this particular case, although our code is more generic than just OneHotEncoder results, so I'm not positive that's generalizable.)

Hi Ben. I have also used a similar workflow in the past, so I totally understand where you are coming from. It can be convenient to store the sparse output from a OneHotEncoder in a DataFrame if you want row and column labels with reasonably fast slicing and do not need to perform arithmetic operations and matrix vector products. I originally added a fill_value parameter to from_spmatrix in this PR for use cases like this, but I had to revert the changes, so we will have to raise a separate issue to potentially change the method signature going forward.

However, after a quick look, it seems pretty straightforward to fix this issue in your workflow for now by using DataFrame.fillna(0.) to fill np.nan with 0. and correctly change the fill_value attribute for each SparseDtype column without modifying the index. I added an example below with an identity matrix. I hope this helps! 😄

bmreiniger · 2024-07-04T14:25:11Z

@christopher-titchen thanks. Filling with zero (after) is a reasonable alternative to casting to integers (before), since apparently pandas is clever enough to change the fill_value. But the broader-setting problem (that my last sentence before hinted at) persists; I should've made that the scope to start with. I'll make a new issue with that in mind.

christophertitchen added 9 commits June 20, 2024 19:13

BUG: 🐛 ✨ Add fill_value param to from_spmatrix method.

cd922b6

ENH: ✨ Set explicit fill_value of NaN for complex floats.

9323e43

TST: ✅ Fix failing tests.

212a664

TST: ✅ Add tests for from_spmatrix method.

81b33f5

DOC: 📝 Add what's new entry.

f5f1479

TST: ✅ Fix failing tests for sparse getitem.

57367aa

TST: ✅ Remove test for 256-bit complex float.

e90963a

DOC: 📝 Update example in docstring for from_spmatrix method.

eb222a6

DOC: 📝 Update some docstrings and sparse user guide.

221c4aa

mroeschke reviewed Jun 21, 2024

View reviewed changes

pandas/core/dtypes/dtypes.py Outdated Show resolved Hide resolved

mroeschke added the Sparse Sparse Data Type label Jun 21, 2024

christophertitchen and others added 6 commits June 22, 2024 13:43

DOC: ✏️ Update dtype docstring.

0d07c30

Co-authored-by: Matthew Roeschke <[email protected]>

BUG: ⏪ 🐛 Revert fill_value change and fix to_coo method.

ccba29e

TST: ⏪ ✅ Fix and add sparse accessor tests.

d09171e

TST: ⏪ ✅ Fix and add sparse getitem tests.

b813453

DOC: ⏪ 📝 Revert fill_value change to sparse user guide.

499db2f

CLN: ✏️ Fix instantiation of np.ma.array in test.

9eb3dac

mroeschke approved these changes Jun 27, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Jun 27, 2024

mroeschke merged commit 195401f into pandas-dev:main Jun 27, 2024
47 checks passed

bmreiniger mentioned this pull request Jul 8, 2024

BUG: regression in master for DataFrame.sparse.from_spmatrix #59212

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: `DataFrame.sparse.from_spmatrix` hard codes an invalid `fill_value` for certain subtypes #59064

BUG: `DataFrame.sparse.from_spmatrix` hard codes an invalid `fill_value` for certain subtypes #59064

Uh oh!

christophertitchen commented Jun 21, 2024 •

edited

Loading

Uh oh!

mroeschke Jun 21, 2024

Uh oh!

christophertitchen Jun 22, 2024

Uh oh!

mroeschke Jun 24, 2024

Uh oh!

christophertitchen Jun 24, 2024

Uh oh!

mroeschke Jun 25, 2024

Uh oh!

christophertitchen Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

mroeschke commented Jun 27, 2024

Uh oh!

bmreiniger commented Jul 2, 2024

Uh oh!

christophertitchen commented Jul 4, 2024

Uh oh!

bmreiniger commented Jul 4, 2024

Uh oh!

Uh oh!

	if fill_value is None:
	fill_value = na_value_for_dtype(dtype)

	sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
	sdf.head()
	sdf.dtypes

	All sparse formats are supported, but matrices that are not in :mod:`COOrdinate <scipy.sparse>` format will be converted, copying data as needed.
	To convert back to sparse SciPy matrix in COO format, you can use the :meth:`DataFrame.sparse.to_coo` method:

	.. ipython:: python

	sdf.sparse.to_coo()

	if sp_arr.fill_value != 0:
	raise ValueError("fill value must be 0 when converting to COO matrix")

Uh oh!

BUG: DataFrame.sparse.from_spmatrix hard codes an invalid fill_value for certain subtypes #59064

BUG: DataFrame.sparse.from_spmatrix hard codes an invalid fill_value for certain subtypes #59064

Uh oh!

Conversation

christophertitchen commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke Jun 21, 2024

Choose a reason for hiding this comment

Uh oh!

christophertitchen Jun 22, 2024

Choose a reason for hiding this comment

Uh oh!

mroeschke Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

christophertitchen Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

mroeschke Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

christophertitchen Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mroeschke commented Jun 27, 2024

Uh oh!

bmreiniger commented Jul 2, 2024

Uh oh!

christophertitchen commented Jul 4, 2024

Uh oh!

bmreiniger commented Jul 4, 2024

Uh oh!

Uh oh!

BUG: `DataFrame.sparse.from_spmatrix` hard codes an invalid `fill_value` for certain subtypes #59064

BUG: `DataFrame.sparse.from_spmatrix` hard codes an invalid `fill_value` for certain subtypes #59064

christophertitchen commented Jun 21, 2024 •

edited

Loading