REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

scottgigante · 2020-05-13T15:50:08Z

closes Filtering dataframe with sparse column leads to NAs in sparse column #27781, closes Sparse DataFrame with all-zeros column returns NA for fancy index #29563
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

scottgigante · 2020-05-14T13:37:30Z

I don't have authority to request reviews, so @TomAugspurger @mroeschke @jorisvandenbossche

doc/source/whatsnew/v1.1.0.rst

pandas/core/internals/blocks.py

pandas/tests/extension/test_sparse.py

jorisvandenbossche · 2020-05-14T18:05:47Z

I am not sure this fix is entirely correct. While it certainly fixed the original buggy case, like:

In [7]: df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) 

In [8]: df1.loc[df1['B'] != 2]       
Out[8]: 
   A  B
0  0  1
2  0  3

(before, this gave NaNs in the first column).

But it also seems to introduce a new bug, eg when doing a reindex:

In [9]: df1.reindex([0,1,3]) 
Out[9]: 
   A    B
0  0  1.0
1  0  2.0
3  0  NaN

This should give a NaN and not 0 at row 3, I think?

scottgigante · 2020-05-14T18:13:44Z

So here's the problem as far as I understand: in https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1219 , blk.fill_value is used to mean the value which is used when you take an index that doesn't exist. In sparse, dtype.fill_value is used to mean the value which is used for filling sparse entries. I don't see an easy way of disentangling these.

jorisvandenbossche · 2020-05-15T20:32:48Z

The naming is potentially a bit confusing here (fill_value in take is generally pointing to what is called "na_value" elsewhere).
But so the question is: for SparseArray, should we be using the "SparseDtype.fill_value" when doing a reindex, or rather the "SparseDtype.na_value". It's not directly clear to me which of the two is correct / desired.

Using "na_value" feels more correct (that's what reindexing does for all other array types), but using "fill_value" can avoid densifying the sparse array in a reindexing operation.

Looking at SparseArray.take itself:

In [15]: a = pd.arrays.SparseArray([1.0, 0.0, np.nan, 0.0], fill_value=0.0)    

In [16]: a       
Out[16]: 
[1.0, 0.0, nan, 0.0]
Fill: 0.0
IntIndex
Indices: array([0, 2], dtype=int32)

In [18]: a.take([0, 1, -1, -1], allow_fill=True)                                                                                                                                                                   
Out[18]: 
[1.0, 0.0, nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 2, 3], dtype=int32)

Here it is clearly "na_value" that is used by default.

pandas/core/internals/blocks.py

doc/source/whatsnew/v1.1.0.rst

pandas/tests/extension/base/getitem.py

scottgigante · 2020-05-17T21:42:09Z

I'm fundamentally unsure how to make this happen, if this is the desired functionality:

>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]}) 
>>> df1.reindex([0,1,3]) 
   A    B
0  0  1.0
1  0  2.0
3  NaN  NaN

…as-dev#29563

jreback · 2020-05-25T22:06:32Z

this is fine, can merge on green.

jreback · 2020-06-01T01:46:42Z

thanks @scottgigante

jorisvandenbossche · 2020-06-01T06:37:22Z

@jreback as shown by the discussion above (and @scottgigante's "I'm fundamentally unsure how to make this happen"), this PR was clearly not ready to be merged.

@scottgigante sorry for the slow follow-up. It's however a complicated topic with not directly a clear easy solution (as you experienced yourself), so it requires some time to delve in, which I didn't find yet

jorisvandenbossche · 2020-06-03T20:51:41Z

@jreback can you react here? Otherwise I am planning to revert this change. It's not clear to me that this is the right fix, as it possibly causes another regression

jreback · 2020-06-03T21:27:54Z

@jorisvandenbossche I am unclear what you think is the problem. This closed a few issues w/o causing others. If you want to revert ok by me, but the soln looked fine.

scottgigante · 2020-06-03T22:15:27Z

@jreback @jorisvandenbossche it's not clear to me that the (unintended) consequence of this PR is a regression but it's worth discussing: what should be the outcome of this action?

df1 = pd.DataFrame({"A": pd.arrays.SparseArray([1, 2, 0]), 'B': [1,2,3]})
df1.reindex([0,1,3])

Current behaviour:

>>> df1.reindex([0,1,3])
    A    B
0 1.0  1.0
1 2.0  2.0
3 NaN  NaN
>>> df1.reindex([0, 1, 3])['A'].sparse.density
1.0

New behaviour:

>>> df1.reindex([0,1,3])
    A    B
0 1.0  1.0
1 2.0  2.0
3 0.0  NaN
>>> df1.reindex([0, 1, 3])['A'].sparse.density
0.6666666666666666

I don't know that filling a sparse array with non-sparse values when reindexing by a nonexistent index is the right thing to do, but if that is desirable then this is a regression (and I don't know how to fix it).

jorisvandenbossche · 2020-06-04T05:47:16Z

Sparse aside: in principle, reindexing always introduces missing values when new labels are present in the indexer.
(as the non sparse column B does in the example above)

So the question for sparse is: do we follow that rule, or do we deviate by using the "fill value" instead of a "missing value" (na_value).

And as I showed in #34158 (comment), SparseArray.take uses na_value and not fill_value. If we think this behaviour is correct, then reindex should also do that IMO

TomAugspurger · 2020-06-04T11:54:28Z

Reindex should always introduce missing values. This behavior on master looks incorrect:

In [6]: pd.Series([0, 1, 2], dtype=pd.SparseDtype(int)).reindex([0, 1, 3])
Out[6]:
0    0.0
1    1.0
3    NaN
dtype: Sparse[float64, 0]

In [7]: pd.DataFrame({"A": pd.SparseArray([0, 1, 2], dtype=pd.SparseDtype(int))}).reindex([0, 1, 3])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: The pandas.SparseArray class is deprecated and will be removed from pandas in a future version. Use pandas.arrays.SparseArray instead.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[7]:
   A
0  0
1  1
3  0

In [8]: _.dtypes
Out[8]:
A    Sparse[int64, 0]
dtype: object

Series is correct, but DataFrame is filling with fill_value rather than the na value.

pandas-dev#29563 (pandas-dev#34158)" This reverts commit a94b13a.

scottgigante marked this pull request as ready for review May 13, 2020 15:56

jreback added the Sparse Sparse Data Type label May 14, 2020

jreback requested changes May 14, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

pandas/tests/extension/test_sparse.py Outdated Show resolved Hide resolved

scottgigante force-pushed the bugfix/sparse-fill-value branch from a4744ea to 3c96b60 Compare May 14, 2020 15:15

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label May 14, 2020

jreback added this to the 1.1 milestone May 14, 2020

jreback requested changes May 17, 2020

View reviewed changes

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

pandas/tests/extension/base/getitem.py Show resolved Hide resolved

scottgigante force-pushed the bugfix/sparse-fill-value branch from 3c96b60 to 0e8c204 Compare May 17, 2020 21:39

scottgigante force-pushed the bugfix/sparse-fill-value branch 2 times, most recently from e8d7648 to c161ae9 Compare May 18, 2020 01:24

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

5188e03

…as-dev#29563

scottgigante force-pushed the bugfix/sparse-fill-value branch from c161ae9 to 5188e03 Compare May 20, 2020 14:21

Merge branch 'master' into bugfix/sparse-fill-value

0a2611a

jreback approved these changes May 25, 2020

View reviewed changes

jreback merged commit a94b13a into pandas-dev:master Jun 1, 2020

jorisvandenbossche mentioned this pull request Jul 15, 2020

REGR: reindex with sparse data #35286

Closed

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jul 15, 2020

Revert "patch ExtensionBlock fill_value, closes pandas-dev#27781, closes

58dccd3

pandas-dev#29563 (pandas-dev#34158)" This reverts commit a94b13a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

scottgigante commented May 13, 2020 •

edited

Loading

scottgigante commented May 14, 2020

jorisvandenbossche commented May 14, 2020

scottgigante commented May 14, 2020

jorisvandenbossche commented May 15, 2020

scottgigante commented May 17, 2020

jreback commented May 25, 2020

jreback commented Jun 1, 2020

jorisvandenbossche commented Jun 1, 2020

jorisvandenbossche commented Jun 3, 2020

jreback commented Jun 3, 2020

scottgigante commented Jun 3, 2020 •

edited

Loading

jorisvandenbossche commented Jun 4, 2020 •

edited

Loading

TomAugspurger commented Jun 4, 2020

REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

Conversation

scottgigante commented May 13, 2020 • edited Loading

scottgigante commented May 14, 2020

jorisvandenbossche commented May 14, 2020

scottgigante commented May 14, 2020

jorisvandenbossche commented May 15, 2020

scottgigante commented May 17, 2020

jreback commented May 25, 2020

jreback commented Jun 1, 2020

jorisvandenbossche commented Jun 1, 2020

jorisvandenbossche commented Jun 3, 2020

jreback commented Jun 3, 2020

scottgigante commented Jun 3, 2020 • edited Loading

jorisvandenbossche commented Jun 4, 2020 • edited Loading

TomAugspurger commented Jun 4, 2020

scottgigante commented May 13, 2020 •

edited

Loading

scottgigante commented Jun 3, 2020 •

edited

Loading

jorisvandenbossche commented Jun 4, 2020 •

edited

Loading