Skip to content

Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DrDaDe opened this issue Dec 4, 2019 · 1 comment
Closed

Comments

@DrDaDe
Copy link

DrDaDe commented Dec 4, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
x = {"A": [1, 2, 3, 4, 5],
     "B": [0, 1, 0, 1, 0],
     "C": [0, 0, 0, 0, 0],
     "Bsp": pd.SparseArray([0, 1, 0, 1, 0], fill_value=0),
     "Csp": pd.SparseArray([0, 0, 0, 0, 0], fill_value=0)
}
xpd = pd.DataFrame(x)

print(xpd)
print("=== all rows")
print(xpd.loc[xpd.index])
print("=== even indices")
print(xpd.loc[[0, 2, 4]])
print("=== B=0s")
print(xpd[xpd.Bsp == 0])
print("=== B=1s")
print(xpd[xpd.Bsp == 1])
print("=== C=0s")
print(xpd[xpd.Csp == 0])
print("=== C=1s")
print(xpd[xpd.Csp == 1])

Output:

   A  B  C  Bsp  Csp
0  1  0  0    0    0
1  2  1  0    1    0
2  3  0  0    0    0
3  4  1  0    1    0
4  5  0  0    0    0
=== all rows
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
1  2  1  0    1  NaN
2  3  0  0    0  NaN
3  4  1  0    1  NaN
4  5  0  0    0  NaN
=== even indices
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
2  3  0  0    0  NaN
4  5  0  0    0  NaN
=== B=0s
/home/daschm/.local/lib/python3.7/site-packages/pandas/core/indexing.py:2418: FutureWarning: DataFrame/Series.to_dense is deprecated and will be removed in a future version
  result = result.to_dense()
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
2  3  0  0    0  NaN
4  5  0  0    0  NaN
=== B=1s
   A  B  C  Bsp  Csp
1  2  1  0    1  NaN
3  4  1  0    1  NaN
=== C=0s
   A  B  C  Bsp  Csp
0  1  0  0    0  NaN
1  2  1  0    1  NaN
2  3  0  0    0  NaN
3  4  1  0    1  NaN
4  5  0  0    0  NaN
=== C=1s
Empty DataFrame
Columns: [A, B, C, Bsp, Csp]
Index: []

Problem description

I'd not expect the value of the Csp column to change if I take a subset of rows. Note that it doesn't seem to matter how I choose my subset of rows. As soon as I use .loc or [(condition)], my full-zero-column turns into a full-NaN-column.

Note that the above example looks artificial, but I ended up in this situation by performing one-hot-encoding via get_dummies() with sparse=True and ending up with problems as soon as I coincidentally chose a subset of my total data for which there is at least one column with full zeroes.

Small "real life" example:

import pandas as pd
x = {"A": [1, 2, 3, 4, 5],
     "B": [0, 1, 2, 1, 0]
}
xpd = pd.DataFrame(x)
dummies_b = pd.get_dummies(xpd.B, prefix="B=", sparse=True)
xpd_ext = pd.concat([xpd, dummies_b], axis=1)

print(xpd_ext)
print(xpd_ext.dtypes)
print("=== even indices + dtypes")
xpd_ext_even = xpd_ext.loc[[0, 2, 4]]
print(xpd_ext_even)
print(xpd_ext_even.dtypes)
print("=== even indices and B=2 + dtypes")
print(xpd_ext_even[xpd_ext_even.B == 2])
print(xpd_ext_even[xpd_ext_even.B == 2].dtypes)

Output:

   A  B  B=_0  B=_1  B=_2
0  1  0     1     0     0
1  2  1     0     1     0
2  3  2     0     0     1
3  4  1     0     1     0
4  5  0     1     0     0
A                  int64
B                  int64
B=_0    Sparse[uint8, 0]
B=_1    Sparse[uint8, 0]
B=_2    Sparse[uint8, 0]
dtype: object
=== even indices + dtypes
   A  B  B=_0  B=_1  B=_2
0  1  0     1     0     0
2  3  2     0     0     1
4  5  0     1     0     0
A                  int64
B                  int64
B=_0    Sparse[int64, 0]
B=_1    Sparse[int64, 0]
B=_2    Sparse[int64, 0]
dtype: object
=== even indices and B=2 + dtypes
   A  B  B=_0  B=_1  B=_2
2  3  2     0   NaN     1
A                    int64
B                    int64
B=_0      Sparse[int64, 0]
B=_1    Sparse[float64, 0]
B=_2      Sparse[int64, 0]

Note how NaNs start to appear after sub-indexing for exactly those sparse columns which were pure 0 before.

I also printed the dtypes here as I coindicentally noticed that these also change for the spurious columns (from int64 to float64). This might help identifying the problem.

I haven't figured out an efficient workaround yet. I get my code to run if I translate my sparse into dense data early enough, but clearly this is very inefficient. Applying "fillna" on potentially spurious columns also seems to be painfully slow.

Output of pd.show_versions()

commit : None python : 3.7.5.candidate.1 python-bits : 64 OS : Linux OS-release : 5.3.0-23-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.7.3
pip : 19.3.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 5.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.2.19
tables : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

At a glance, this looks like #27781. Closing as a duplicate, but let me know if you think it's different.

@TomAugspurger TomAugspurger added this to the No action milestone Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants