Skip to content

REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 1, 2020

Conversation

scottgigante
Copy link
Contributor

@scottgigante scottgigante commented May 13, 2020

@scottgigante scottgigante marked this pull request as ready for review May 13, 2020 15:56
@scottgigante
Copy link
Contributor Author

I don't have authority to request reviews, so @TomAugspurger @mroeschke @jorisvandenbossche

@jreback jreback added the Sparse Sparse Data Type label May 14, 2020
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from a4744ea to 3c96b60 Compare May 14, 2020 15:15
@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label May 14, 2020
@jreback jreback added this to the 1.1 milestone May 14, 2020
@jorisvandenbossche
Copy link
Member

I am not sure this fix is entirely correct. While it certainly fixed the original buggy case, like:

In [7]: df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) 

In [8]: df1.loc[df1['B'] != 2]       
Out[8]: 
   A  B
0  0  1
2  0  3

(before, this gave NaNs in the first column).

But it also seems to introduce a new bug, eg when doing a reindex:

In [9]: df1.reindex([0,1,3]) 
Out[9]: 
   A    B
0  0  1.0
1  0  2.0
3  0  NaN

This should give a NaN and not 0 at row 3, I think?

@scottgigante
Copy link
Contributor Author

So here's the problem as far as I understand: in https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1219 , blk.fill_value is used to mean the value which is used when you take an index that doesn't exist. In sparse, dtype.fill_value is used to mean the value which is used for filling sparse entries. I don't see an easy way of disentangling these.

@jorisvandenbossche
Copy link
Member

The naming is potentially a bit confusing here (fill_value in take is generally pointing to what is called "na_value" elsewhere).
But so the question is: for SparseArray, should we be using the "SparseDtype.fill_value" when doing a reindex, or rather the "SparseDtype.na_value". It's not directly clear to me which of the two is correct / desired.

Using "na_value" feels more correct (that's what reindexing does for all other array types), but using "fill_value" can avoid densifying the sparse array in a reindexing operation.

Looking at SparseArray.take itself:

In [15]: a = pd.arrays.SparseArray([1.0, 0.0, np.nan, 0.0], fill_value=0.0)    

In [16]: a       
Out[16]: 
[1.0, 0.0, nan, 0.0]
Fill: 0.0
IntIndex
Indices: array([0, 2], dtype=int32)

In [18]: a.take([0, 1, -1, -1], allow_fill=True)                                                                                                                                                                   
Out[18]: 
[1.0, 0.0, nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 2, 3], dtype=int32)

Here it is clearly "na_value" that is used by default.

@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from 3c96b60 to 0e8c204 Compare May 17, 2020 21:39
@scottgigante
Copy link
Contributor Author

I'm fundamentally unsure how to make this happen, if this is the desired functionality:

>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) 
>>> df1.reindex([0,1,3]) 
   A    B
0  0  1.0
1  0  2.0
3  NaN  NaN

@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch 2 times, most recently from e8d7648 to c161ae9 Compare May 18, 2020 01:24
@scottgigante scottgigante force-pushed the bugfix/sparse-fill-value branch from c161ae9 to 5188e03 Compare May 20, 2020 14:21
@jreback
Copy link
Contributor

jreback commented May 25, 2020

this is fine, can merge on green.

@jreback jreback merged commit a94b13a into pandas-dev:master Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

thanks @scottgigante

@jorisvandenbossche
Copy link
Member

@jreback as shown by the discussion above (and @scottgigante's "I'm fundamentally unsure how to make this happen"), this PR was clearly not ready to be merged.

@scottgigante sorry for the slow follow-up. It's however a complicated topic with not directly a clear easy solution (as you experienced yourself), so it requires some time to delve in, which I didn't find yet

@jorisvandenbossche
Copy link
Member

@jreback can you react here? Otherwise I am planning to revert this change. It's not clear to me that this is the right fix, as it possibly causes another regression

@jreback
Copy link
Contributor

jreback commented Jun 3, 2020

@jorisvandenbossche I am unclear what you think is the problem. This closed a few issues w/o causing others. If you want to revert ok by me, but the soln looked fine.

@scottgigante
Copy link
Contributor Author

scottgigante commented Jun 3, 2020

@jreback @jorisvandenbossche it's not clear to me that the (unintended) consequence of this PR is a regression but it's worth discussing: what should be the outcome of this action?

df1 = pd.DataFrame({"A": pd.arrays.SparseArray([1, 2, 0]), 'B': [1,2,3]})
df1.reindex([0,1,3])

Current behaviour:

>>> df1.reindex([0,1,3])
    A    B
0 1.0  1.0
1 2.0  2.0
3 NaN  NaN
>>> df1.reindex([0, 1, 3])['A'].sparse.density
1.0

New behaviour:

>>> df1.reindex([0,1,3])
    A    B
0 1.0  1.0
1 2.0  2.0
3 0.0  NaN
>>> df1.reindex([0, 1, 3])['A'].sparse.density
0.6666666666666666

I don't know that filling a sparse array with non-sparse values when reindexing by a nonexistent index is the right thing to do, but if that is desirable then this is a regression (and I don't know how to fix it).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 4, 2020

Sparse aside: in principle, reindexing always introduces missing values when new labels are present in the indexer.
(as the non sparse column B does in the example above)

So the question for sparse is: do we follow that rule, or do we deviate by using the "fill value" instead of a "missing value" (na_value).

And as I showed in #34158 (comment), SparseArray.take uses na_value and not fill_value. If we think this behaviour is correct, then reindex should also do that IMO

@TomAugspurger
Copy link
Contributor

Reindex should always introduce missing values. This behavior on master looks incorrect:

In [6]: pd.Series([0, 1, 2], dtype=pd.SparseDtype(int)).reindex([0, 1, 3])
Out[6]:
0    0.0
1    1.0
3    NaN
dtype: Sparse[float64, 0]

In [7]: pd.DataFrame({"A": pd.SparseArray([0, 1, 2], dtype=pd.SparseDtype(int))}).reindex([0, 1, 3])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: The pandas.SparseArray class is deprecated and will be removed from pandas in a future version. Use pandas.arrays.SparseArray instead.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[7]:
   A
0  0
1  1
3  0

In [8]: _.dtypes
Out[8]:
A    Sparse[int64, 0]
dtype: object

Series is correct, but DataFrame is filling with fill_value rather than the na value.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type
Projects
None yet
4 participants