Skip to content

BUG: DataFrame.sparse.from_spmatrix hard codes an invalid fill_value for certain subtypes #59064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

christophertitchen
Copy link
Contributor

@christophertitchen christophertitchen commented Jun 21, 2024

@@ -313,7 +331,7 @@ def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame:
indices = data.indices
indptr = data.indptr
array_data = data.data
dtype = SparseDtype(array_data.dtype, 0)
dtype = SparseDtype(array_data.dtype, fill_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use na_value_for_dtype instead of introducing a new argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matthew. Thank you for your speedy response and taking the time to review my PR.

na_value_for_dtype is indirectly called in this implementation with the default argument of fill_value as None when passed to the SparseDtype constructor.

if fill_value is None:
fill_value = na_value_for_dtype(dtype)

A fill_value parameter is typically only needed when constructing a sparse format object from a dense format object, as we need to identify non-zero elements in the data to set attributes like the data, index, and index pointer (for sparse array formats like BSR, CSR, and CSC) in the sparse format object correctly. In this case, we are simply accessing these attributes from a CSC matrix, so you are right in that a fill_value parameter is not strictly required to solve the bug.

However, in addition to fixing the bug and not adding overhead, it is handy to have a fill_value parameter to give the user flexibility in certain use cases, for example, converting a SparseArray to a np.ndarray using np.asarray will use fill_value to populate the missing elements. This is along the lines of how, albeit not a sparse implementation, np.ma.core.MaskedArray has a filled method that performs the same functionality of using a custom fill_value to convert to a np.ndarray.

Would you be OK with me keeping it?

The current test failures also seem unrelated to the PR and have been around since #59027.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core maintainers have been cautious about expanding the APIs surface areas and signatures unless necessary or for consistency issues. I would prefer if this bug was solved without adding a new keyword argument first, and then you could open a new issue about adding a new keyword here that has opt-in from more core team memebers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand and yes, the cautious approach makes total sense given the size of pandas.

The issues that I see arising as a result of removing the fill_value parameter and implementing your change are:

  • DataFrame.sparse.from_spmatrix().sparse.to_coo, shown as an example in the user guide, will break for float and complex subtypes as it will raise a ValueError. It seems like the original thought of adding this was because a custom fill_value will be lost upon converting to a COO matrix as scipy.sparse._coo.coo_matrix and other sparse formats do not have an analogous attribute and instead use False, 0, 0., and 0. + 0.j when returning a dense representation of them as a np.ndarray or np.matrix, which was considered unexpected behaviour at the time. 🤷 We can remove this check as the constructor called in to_coo is using the ijv format and not directly instantiating using a single 2-D np.ndarray, so the COO matrix returned will be correct regardless of the fill_value.
    sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
    sdf.head()
    sdf.dtypes
    All sparse formats are supported, but matrices that are not in :mod:`COOrdinate <scipy.sparse>` format will be converted, copying data as needed.
    To convert back to sparse SciPy matrix in COO format, you can use the :meth:`DataFrame.sparse.to_coo` method:
    .. ipython:: python
    sdf.sparse.to_coo()

    if sp_arr.fill_value != 0:
    raise ValueError("fill value must be 0 when converting to COO matrix")
  • All of the tests that rely on a DataFrame.sparse.from_spmatrix invocation, apart from the one that I added to test the changes, test_from_spmatrix_fill_value, assume a fill_value of 0. The expectations will all have to be changed and will subsequently have a dependency on na_value_for_dtype for the fill_value to use, and the docstring example will have to change.

Would you like me to make the changes above, close the PR, or keep the fill_value parameter as it is currently implemented in my branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes above would be appropriate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. It should be rather straightforward and quick in that case! I will make the changes now.

@mroeschke mroeschke added the Sparse Sparse Data Type label Jun 21, 2024
@mroeschke mroeschke added this to the 3.0 milestone Jun 27, 2024
@mroeschke mroeschke merged commit 195401f into pandas-dev:main Jun 27, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @christopher-titchen

@bmreiniger
Copy link

This seems to be responsible for a breaking change in a workflow of mine. We consume the output of a sklearn OneHotEncoder, which is sparse with float type, and instantiate a sparse pandas frame from it. That used to produce values of 1.0 and 0.0, and now produces instead 1.0 and np.nan.

It doesn't look like the sparse instantiation allows the fill_value; is there another easy way we can adjust to the new behavior? (Casting to integers would be fine for this particular case, although our code is more generic than just OneHotEncoder results, so I'm not positive that's generalizable.)

@christophertitchen
Copy link
Contributor Author

This seems to be responsible for a breaking change in a workflow of mine. We consume the output of a sklearn OneHotEncoder, which is sparse with float type, and instantiate a sparse pandas frame from it. That used to produce values of 1.0 and 0.0, and now produces instead 1.0 and np.nan.

It doesn't look like the sparse instantiation allows the fill_value; is there another easy way we can adjust to the new behavior? (Casting to integers would be fine for this particular case, although our code is more generic than just OneHotEncoder results, so I'm not positive that's generalizable.)

Hi Ben. I have also used a similar workflow in the past, so I totally understand where you are coming from. It can be convenient to store the sparse output from a OneHotEncoder in a DataFrame if you want row and column labels with reasonably fast slicing and do not need to perform arithmetic operations and matrix vector products. I originally added a fill_value parameter to from_spmatrix in this PR for use cases like this, but I had to revert the changes, so we will have to raise a separate issue to potentially change the method signature going forward.

However, after a quick look, it seems pretty straightforward to fix this issue in your workflow for now by using DataFrame.fillna(0.) to fill np.nan with 0. and correctly change the fill_value attribute for each SparseDtype column without modifying the index. I added an example below with an identity matrix. I hope this helps! 😄

image

@bmreiniger
Copy link

@christopher-titchen thanks. Filling with zero (after) is a reasonable alternative to casting to integers (before), since apparently pandas is clever enough to change the fill_value. But the broader-setting problem (that my last sentence before hinted at) persists; I should've made that the scope to start with. I'll make a new issue with that in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.sparse.from_spmatrix hard codes an invalid fill_value for certain subtypes
3 participants