Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

DrDaDe · 2019-12-04T13:37:15Z

Code Sample, a copy-pastable example if possible

import pandas as pd
x = {"A": [1, 2, 3, 4, 5],
     "B": [0, 1, 0, 1, 0],
     "C": [0, 0, 0, 0, 0],
     "Bsp": pd.SparseArray([0, 1, 0, 1, 0], fill_value=0),
     "Csp": pd.SparseArray([0, 0, 0, 0, 0], fill_value=0)
}
xpd = pd.DataFrame(x)

print(xpd)
print("=== all rows")
print(xpd.loc[xpd.index])
print("=== even indices")
print(xpd.loc[[0, 2, 4]])
print("=== B=0s")
print(xpd[xpd.Bsp == 0])
print("=== B=1s")
print(xpd[xpd.Bsp == 1])
print("=== C=0s")
print(xpd[xpd.Csp == 0])
print("=== C=1s")
print(xpd[xpd.Csp == 1])

Output:

   A  B  C  Bsp  Csp
0  1  0  0    0    0
1  2  1  0    1    0
2  3  0  0    0    0
3  4  1  0    1    0
4  5  0  0    0    0
=== all rows
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
1  2  1  0    1  NaN
2  3  0  0    0  NaN
3  4  1  0    1  NaN
4  5  0  0    0  NaN
=== even indices
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
2  3  0  0    0  NaN
4  5  0  0    0  NaN
=== B=0s
/home/daschm/.local/lib/python3.7/site-packages/pandas/core/indexing.py:2418: FutureWarning: DataFrame/Series.to_dense is deprecated and will be removed in a future version
  result = result.to_dense()
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
2  3  0  0    0  NaN
4  5  0  0    0  NaN
=== B=1s
   A  B  C  Bsp  Csp
1  2  1  0    1  NaN
3  4  1  0    1  NaN
=== C=0s
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
1  2  1  0    1  NaN
2  3  0  0    0  NaN
3  4  1  0    1  NaN
4  5  0  0    0  NaN
=== C=1s
Empty DataFrame
Columns: [A, B, C, Bsp, Csp]
Index: []

Problem description

I'd not expect the value of the Csp column to change if I take a subset of rows. Note that it doesn't seem to matter how I choose my subset of rows. As soon as I use .loc or [(condition)], my full-zero-column turns into a full-NaN-column.

Note that the above example looks artificial, but I ended up in this situation by performing one-hot-encoding via get_dummies() with sparse=True and ending up with problems as soon as I coincidentally chose a subset of my total data for which there is at least one column with full zeroes.

Small "real life" example:

import pandas as pd
x = {"A": [1, 2, 3, 4, 5],
     "B": [0, 1, 2, 1, 0]
}
xpd = pd.DataFrame(x)
dummies_b = pd.get_dummies(xpd.B, prefix="B=", sparse=True)
xpd_ext = pd.concat([xpd, dummies_b], axis=1)

print(xpd_ext)
print(xpd_ext.dtypes)
print("=== even indices + dtypes")
xpd_ext_even = xpd_ext.loc[[0, 2, 4]]
print(xpd_ext_even)
print(xpd_ext_even.dtypes)
print("=== even indices and B=2 + dtypes")
print(xpd_ext_even[xpd_ext_even.B == 2])
print(xpd_ext_even[xpd_ext_even.B == 2].dtypes)

Output:

   A  B  B=_0  B=_1  B=_2
0  1  0     1     0     0
1  2  1     0     1     0
2  3  2     0     0     1
3  4  1     0     1     0
4  5  0     1     0     0
A                  int64
B                  int64
B=_0    Sparse[uint8, 0]
B=_1    Sparse[uint8, 0]
B=_2    Sparse[uint8, 0]
dtype: object
=== even indices + dtypes
   A  B  B=_0  B=_1  B=_2
0  1  0     1     0     0
2  3  2     0     0     1
4  5  0     1     0     0
A                  int64
B                  int64
B=_0    Sparse[int64, 0]
B=_1    Sparse[int64, 0]
B=_2    Sparse[int64, 0]
dtype: object
=== even indices and B=2 + dtypes
   A  B  B=_0  B=_1  B=_2
2  3  2     0   NaN     1
A                    int64
B                    int64
B=_0      Sparse[int64, 0]
B=_1    Sparse[float64, 0]
B=_2      Sparse[int64, 0]

Note how NaNs start to appear after sub-indexing for exactly those sparse columns which were pure 0 before.

I also printed the dtypes here as I coindicentally noticed that these also change for the spurious columns (from int64 to float64). This might help identifying the problem.

I haven't figured out an efficient workaround yet. I get my code to run if I translate my sparse into dense data early enough, but clearly this is very inefficient. Applying "fillna" on potentially spurious columns also seems to be painfully slow.

Output of `pd.show_versions()`

commit : None python : 3.7.5.candidate.1 python-bits : 64 OS : Linux OS-release : 5.3.0-23-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.7.3
pip : 19.3.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 5.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.2.19
tables : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-12-04T13:45:31Z

At a glance, this looks like #27781. Closing as a duplicate, but let me know if you think it's different.

TomAugspurger closed this as completed Dec 4, 2019

TomAugspurger added this to the No action milestone Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

DrDaDe commented Dec 4, 2019

TomAugspurger commented Dec 4, 2019

Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

Comments

DrDaDe commented Dec 4, 2019

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

TomAugspurger commented Dec 4, 2019

Output of `pd.show_versions()`