sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

marginalhours · 2019-11-20T12:24:32Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from scipy.sparse import csr_matrix

f = pd.DataFrame.sparse.from_spmatrix(csr_matrix((4, 4)), index=pd.MultiIndex.from_product([[1, 2], [1,2]]))

f
f.sort_index(level=0)

Problem description

When I create a MultiIndexed DataFrame with sparse columns, then call sort_index(), in the result 0.0 (the sparse fill value) has been replaced with NaN -- but it seems like only for columns which are all the fill value.

It seems like this might also be an issue with RangeIndex as well, since if I make a simpler DataFrame and call sort_index, without any arguments it's ok but adding level=0 (or in fact level=(anything) as a kwarg will cause the frame to be filled with NaN without raising any kind of warning

I would expect the following behaviour:

Values preserved for MultiIndex
For RangeIndex, the level keyword should either be totally ignored (no impact on output) or should not be valid to pass in.

I don't think this is a duplicate, I searched the tracker but the combination of sparse + multiindex seems to be rare

Expected Output

>>> f.sort_index(level=0)
       0    1    2    3
1 1  0.0  0.0  0.0  0.0
  2  0.0  0.0  0.0  0.0
2 1  0.0  0.0  0.0  0.0
  2  0.0  0.0  0.0  0.0

Actual Output

>>> f.sort_index(level=0)
      0   1   2   3
1 1 NaN NaN NaN NaN
  2 NaN NaN NaN NaN
2 1 NaN NaN NaN NaN
  2 NaN NaN NaN NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-36-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.14
pytest : 3.10.1
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

marginalhours · 2019-11-20T14:20:27Z

I am not a pandas expert, but it seems the issue might be with the fill_value of the underlying Block object for sparse columns. I would expect the fill_value of the block to be the same as the fill_value of the column wrapping the block in a SingleBlockManager.

marginalhours · 2019-11-20T14:54:14Z

https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1723

This seems to be the offending line -- for an ExtensionBlock, the na_value from the dtype is used for the fill_value. Perhaps it should read return self.values.dtype.fill_value? This would fix my immediate problem but probably break something else

Dr-Irv · 2020-09-05T15:54:18Z

In version 1.1.1, this does not happen.

Dr-Irv · 2020-09-05T18:33:04Z

A PR that creates a test case would be welcomed.

ylin00 · 2020-09-09T00:29:20Z

take

Dr-Irv closed this as completed Sep 5, 2020

Dr-Irv reopened this Sep 5, 2020

Dr-Irv added the good first issue label Sep 5, 2020

Dr-Irv added the Needs Tests Unit test(s) needed to prevent regressions label Sep 5, 2020

github-actions bot assigned ylin00 Sep 9, 2020

ylin00 mentioned this issue Sep 9, 2020

TST: add test case for sort_index on multiindexed Frame with sparse cols #36236

Merged

5 tasks

jreback added this to the 1.2 milestone Sep 12, 2020

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Sep 12, 2020

jreback closed this as completed in #36236 Sep 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

marginalhours commented Nov 20, 2019 •

edited

Loading

INSTALLED VERSIONS

marginalhours commented Nov 20, 2019

marginalhours commented Nov 20, 2019

Dr-Irv commented Sep 5, 2020

Dr-Irv commented Sep 5, 2020

ylin00 commented Sep 9, 2020

sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

Comments

marginalhours commented Nov 20, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Actual Output

Output of pd.show_versions()

INSTALLED VERSIONS

marginalhours commented Nov 20, 2019

marginalhours commented Nov 20, 2019

Dr-Irv commented Sep 5, 2020

Dr-Irv commented Sep 5, 2020

ylin00 commented Sep 9, 2020

marginalhours commented Nov 20, 2019 •

edited

Loading

Output of `pd.show_versions()`