-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158
Conversation
I don't have authority to request reviews, so @TomAugspurger @mroeschke @jorisvandenbossche |
a4744ea
to
3c96b60
Compare
I am not sure this fix is entirely correct. While it certainly fixed the original buggy case, like:
(before, this gave NaNs in the first column). But it also seems to introduce a new bug, eg when doing a reindex:
This should give a NaN and not 0 at row 3, I think? |
So here's the problem as far as I understand: in https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1219 , |
The naming is potentially a bit confusing here ( Using "na_value" feels more correct (that's what reindexing does for all other array types), but using "fill_value" can avoid densifying the sparse array in a reindexing operation. Looking at
Here it is clearly "na_value" that is used by default. |
3c96b60
to
0e8c204
Compare
I'm fundamentally unsure how to make this happen, if this is the desired functionality:
|
e8d7648
to
c161ae9
Compare
c161ae9
to
5188e03
Compare
this is fine, can merge on green. |
thanks @scottgigante |
@jreback as shown by the discussion above (and @scottgigante's "I'm fundamentally unsure how to make this happen"), this PR was clearly not ready to be merged. @scottgigante sorry for the slow follow-up. It's however a complicated topic with not directly a clear easy solution (as you experienced yourself), so it requires some time to delve in, which I didn't find yet |
@jreback can you react here? Otherwise I am planning to revert this change. It's not clear to me that this is the right fix, as it possibly causes another regression |
@jorisvandenbossche I am unclear what you think is the problem. This closed a few issues w/o causing others. If you want to revert ok by me, but the soln looked fine. |
@jreback @jorisvandenbossche it's not clear to me that the (unintended) consequence of this PR is a regression but it's worth discussing: what should be the outcome of this action?
Current behaviour:
New behaviour:
I don't know that filling a sparse array with non-sparse values when reindexing by a nonexistent index is the right thing to do, but if that is desirable then this is a regression (and I don't know how to fix it). |
Sparse aside: in principle, reindexing always introduces missing values when new labels are present in the indexer. So the question for sparse is: do we follow that rule, or do we deviate by using the "fill value" instead of a "missing value" ( And as I showed in #34158 (comment), |
Reindex should always introduce missing values. This behavior on master looks incorrect: In [6]: pd.Series([0, 1, 2], dtype=pd.SparseDtype(int)).reindex([0, 1, 3])
Out[6]:
0 0.0
1 1.0
3 NaN
dtype: Sparse[float64, 0]
In [7]: pd.DataFrame({"A": pd.SparseArray([0, 1, 2], dtype=pd.SparseDtype(int))}).reindex([0, 1, 3])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: The pandas.SparseArray class is deprecated and will be removed from pandas in a future version. Use pandas.arrays.SparseArray instead.
#!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[7]:
A
0 0
1 1
3 0
In [8]: _.dtypes
Out[8]:
A Sparse[int64, 0]
dtype: object Series is correct, but DataFrame is filling with |
pandas-dev#29563 (pandas-dev#34158)" This reverts commit a94b13a.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff