Skip to content

sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marginalhours opened this issue Nov 20, 2019 · 5 comments · Fixed by #36236
Closed

sort_index on a multiindexed DataFrame with sparse columns fills with NaNs #29735

marginalhours opened this issue Nov 20, 2019 · 5 comments · Fixed by #36236
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Milestone

Comments

@marginalhours
Copy link

marginalhours commented Nov 20, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
from scipy.sparse import csr_matrix

f = pd.DataFrame.sparse.from_spmatrix(csr_matrix((4, 4)), index=pd.MultiIndex.from_product([[1, 2], [1,2]]))

f
f.sort_index(level=0)

Problem description

When I create a MultiIndexed DataFrame with sparse columns, then call sort_index(), in the result 0.0 (the sparse fill value) has been replaced with NaN -- but it seems like only for columns which are all the fill value.

It seems like this might also be an issue with RangeIndex as well, since if I make a simpler DataFrame and call sort_index, without any arguments it's ok but adding level=0 (or in fact level=(anything) as a kwarg will cause the frame to be filled with NaN without raising any kind of warning

I would expect the following behaviour:

  • Values preserved for MultiIndex
  • For RangeIndex, the level keyword should either be totally ignored (no impact on output) or should not be valid to pass in.

I don't think this is a duplicate, I searched the tracker but the combination of sparse + multiindex seems to be rare

Expected Output

>>> f.sort_index(level=0)
       0    1    2    3
1 1  0.0  0.0  0.0  0.0
  2  0.0  0.0  0.0  0.0
2 1  0.0  0.0  0.0  0.0
  2  0.0  0.0  0.0  0.0

Actual Output

>>> f.sort_index(level=0)
      0   1   2   3
1 1 NaN NaN NaN NaN
  2 NaN NaN NaN NaN
2 1 NaN NaN NaN NaN
  2 NaN NaN NaN NaN

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-36-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.14
pytest : 3.10.1
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@marginalhours
Copy link
Author

I am not a pandas expert, but it seems the issue might be with the fill_value of the underlying Block object for sparse columns. I would expect the fill_value of the block to be the same as the fill_value of the column wrapping the block in a SingleBlockManager.

@marginalhours
Copy link
Author

https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1723

This seems to be the offending line -- for an ExtensionBlock, the na_value from the dtype is used for the fill_value. Perhaps it should read return self.values.dtype.fill_value? This would fix my immediate problem but probably break something else

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 5, 2020

In version 1.1.1, this does not happen.

@Dr-Irv Dr-Irv closed this as completed Sep 5, 2020
@Dr-Irv Dr-Irv reopened this Sep 5, 2020
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 5, 2020

A PR that creates a test case would be welcomed.

@Dr-Irv Dr-Irv added the Needs Tests Unit test(s) needed to prevent regressions label Sep 5, 2020
@ylin00
Copy link
Contributor

ylin00 commented Sep 9, 2020

take

@jreback jreback added this to the 1.2 milestone Sep 12, 2020
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants