You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A B C Bsp Csp
0 1 0 0 0 0
1 2 1 0 1 0
2 3 0 0 0 0
3 4 1 0 1 0
4 5 0 0 0 0
=== all rows
A B C Bsp Csp
0 1 0 0 0 NaN
1 2 1 0 1 NaN
2 3 0 0 0 NaN
3 4 1 0 1 NaN
4 5 0 0 0 NaN
=== even indices
A B C Bsp Csp
0 1 0 0 0 NaN
2 3 0 0 0 NaN
4 5 0 0 0 NaN
=== B=0s
/home/daschm/.local/lib/python3.7/site-packages/pandas/core/indexing.py:2418: FutureWarning: DataFrame/Series.to_dense is deprecated and will be removed in a future version
result = result.to_dense()
A B C Bsp Csp
0 1 0 0 0 NaN
2 3 0 0 0 NaN
4 5 0 0 0 NaN
=== B=1s
A B C Bsp Csp
1 2 1 0 1 NaN
3 4 1 0 1 NaN
=== C=0s
A B C Bsp Csp
0 1 0 0 0 NaN
1 2 1 0 1 NaN
2 3 0 0 0 NaN
3 4 1 0 1 NaN
4 5 0 0 0 NaN
=== C=1s
Empty DataFrame
Columns: [A, B, C, Bsp, Csp]
Index: []
Problem description
I'd not expect the value of the Csp column to change if I take a subset of rows. Note that it doesn't seem to matter how I choose my subset of rows. As soon as I use .loc or [(condition)], my full-zero-column turns into a full-NaN-column.
Note that the above example looks artificial, but I ended up in this situation by performing one-hot-encoding via get_dummies() with sparse=True and ending up with problems as soon as I coincidentally chose a subset of my total data for which there is at least one column with full zeroes.
Small "real life" example:
import pandas as pd
x = {"A": [1, 2, 3, 4, 5],
"B": [0, 1, 2, 1, 0]
}
xpd = pd.DataFrame(x)
dummies_b = pd.get_dummies(xpd.B, prefix="B=", sparse=True)
xpd_ext = pd.concat([xpd, dummies_b], axis=1)
print(xpd_ext)
print(xpd_ext.dtypes)
print("=== even indices + dtypes")
xpd_ext_even = xpd_ext.loc[[0, 2, 4]]
print(xpd_ext_even)
print(xpd_ext_even.dtypes)
print("=== even indices and B=2 + dtypes")
print(xpd_ext_even[xpd_ext_even.B == 2])
print(xpd_ext_even[xpd_ext_even.B == 2].dtypes)
Output:
A B B=_0 B=_1 B=_2
0 1 0 1 0 0
1 2 1 0 1 0
2 3 2 0 0 1
3 4 1 0 1 0
4 5 0 1 0 0
A int64
B int64
B=_0 Sparse[uint8, 0]
B=_1 Sparse[uint8, 0]
B=_2 Sparse[uint8, 0]
dtype: object
=== even indices + dtypes
A B B=_0 B=_1 B=_2
0 1 0 1 0 0
2 3 2 0 0 1
4 5 0 1 0 0
A int64
B int64
B=_0 Sparse[int64, 0]
B=_1 Sparse[int64, 0]
B=_2 Sparse[int64, 0]
dtype: object
=== even indices and B=2 + dtypes
A B B=_0 B=_1 B=_2
2 3 2 0 NaN 1
A int64
B int64
B=_0 Sparse[int64, 0]
B=_1 Sparse[float64, 0]
B=_2 Sparse[int64, 0]
Note how NaNs start to appear after sub-indexing for exactly those sparse columns which were pure 0 before.
I also printed the dtypes here as I coindicentally noticed that these also change for the spurious columns (from int64 to float64). This might help identifying the problem.
I haven't figured out an efficient workaround yet. I get my code to run if I translate my sparse into dense data early enough, but clearly this is very inefficient. Applying "fillna" on potentially spurious columns also seems to be painfully slow.
Output of pd.show_versions()
commit : None
python : 3.7.5.candidate.1
python-bits : 64
OS : Linux
OS-release : 5.3.0-23-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
Code Sample, a copy-pastable example if possible
Output:
Problem description
I'd not expect the value of the Csp column to change if I take a subset of rows. Note that it doesn't seem to matter how I choose my subset of rows. As soon as I use .loc or [(condition)], my full-zero-column turns into a full-NaN-column.
Note that the above example looks artificial, but I ended up in this situation by performing one-hot-encoding via get_dummies() with sparse=True and ending up with problems as soon as I coincidentally chose a subset of my total data for which there is at least one column with full zeroes.
Small "real life" example:
Output:
Note how NaNs start to appear after sub-indexing for exactly those sparse columns which were pure 0 before.
I also printed the dtypes here as I coindicentally noticed that these also change for the spurious columns (from int64 to float64). This might help identifying the problem.
I haven't figured out an efficient workaround yet. I get my code to run if I translate my sparse into dense data early enough, but clearly this is very inefficient. Applying "fillna" on potentially spurious columns also seems to be painfully slow.
Output of
pd.show_versions()
pandas : 0.25.3
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.7.3
pip : 19.3.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 5.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.2.19
tables : None
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: