BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687

deusebio · 2020-06-10T12:11:19Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

# Your code here

from scipy.sparse import coo_matrix, eye  
import pandas as pd

df = pd.DataFrame.sparse.from_spmatrix(eye(10))

df.sparse.density 
# Should output 0.1

df.loc[range(5)]                                                        
#     0    1    2    3    4    5    6    7    8    9
# 0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 3  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
# 4  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0

df.loc[range(5)].sparse.density
# outputs 0.1

df.loc[range(5)].loc[range(3)]                                          
#     0    1    2    3    4   5   6   7   8   9
# 0  1.0  0.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
# 1  0.0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
# 2  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN

df.loc[range(5)].loc[range(3)].sparse.density
# outputs 0.6

Problem description

It seems that sparse DataFrame extracted using loc does not behave as the original ones. This creates inconsistencies in our processing pipelines, depending on filtering and selection that has been applied, sometimes producing "Nans", severely impacting memory consumption and computational time.

Expected Output

The output of loc should not depend on multiple slicing, i.e.

df.loc[range(5)].loc[range(3)] = df.loc[range(3)]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.3
setuptools : 39.0.1
Cython : 0.29.19
pytest : 5.4.3
hypothesis : 5.16.1
sphinx : 3.1.0
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.9
numba : 0.49.1

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-06-10T12:57:01Z

On master I have

In [41]: df.loc[range(5)].loc[range(3)].sparse.density
Out[41]: 0.1

But for some reason the last columns are cast to Sparse[int] dtype

In [53]: df.loc[range(5)].loc[range(3)].dtypes
Out[53]:
0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
5      Sparse[int64, 0]
6      Sparse[int64, 0]
7      Sparse[int64, 0]
8      Sparse[int64, 0]
9      Sparse[int64, 0]
dtype: object

Not sure what's going on.

deusebio · 2020-06-10T16:47:14Z

Quite strange that we have different results and you cannot reproduce my outputs.

Here is the commit I have installed (bbb89ca).

Python 3.6.5 (default, Jan 14 2020, 14:04:16) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd                                                                                                          

In [2]: from scipy.sparse import coo_matrix, eye                                                                                     

In [3]: df = pd.DataFrame.sparse.from_spmatrix(eye(10))                                                                              

In [4]: df.loc[range(5)].loc[range(3)]                                                                                               
Out[4]: 
     0    1    2    3    4   5   6   7   8   9
0  1.0  0.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
1  0.0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
2  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN

In [5]: exit

Not sure it would matter but I'm using python 3.6.5 on a mac osx 10.15.2.

BTW, the issue about the data-type you got might be related to this other issue #34540

MJafarMashhadi · 2020-06-17T22:01:34Z

I was looking into this issue and must add it happens when the second .loc lookup is an iterable, but it's fine with slices. I mean:

In [1]: df.loc[[0, 1, 2, 3, 4, 5]].loc[[3,4,5]]                                                                                           
Out[1]: 
     0    1    2    3    4    5   6   7   8   9
3  0.0  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN
4  0.0  0.0  0.0  0.0  1.0  0.0 NaN NaN NaN NaN
5  0.0  0.0  0.0  0.0  0.0  1.0 NaN NaN NaN NaN

In [2]: df.loc[[0, 1, 2, 3, 4, 5]].loc[:3]  # or df.loc[range(6)].loc[:3]
Out[2]: 
     0    1    2    3    4    5    6    7    8    9
0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0

Another observation is that the number of columns it keeps depends in the first iloc lookup. See:

In [1]: df.loc[range(3)].loc[range(3)]                                                                                                   
Out[1]: 
     0    1    2   3   4   5   6   7   8   9
0  1.0  0.0  0.0 NaN NaN NaN NaN NaN NaN NaN
1  0.0  1.0  0.0 NaN NaN NaN NaN NaN NaN NaN
2  0.0  0.0  1.0 NaN NaN NaN NaN NaN NaN NaN
In [2]: df.loc[range(4)].loc[range(3)]                                                                                                   
Out[2]: 
     0    1    2    3   4   5   6   7   8   9
0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN NaN
1  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN NaN
2  0.0  0.0  1.0  0.0 NaN NaN NaN NaN NaN NaN
In [3]: df.loc[range(8)].loc[range(3)]                                                                                                   
Out[3]: 
     0    1    2    3    4    5    6    7   8   9
0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 NaN NaN
1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0 NaN NaN
2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0 NaN NaN

phofl · 2020-11-21T00:21:49Z

This works now on master. Dtypes are fine and results equal expected. May need tests

weikhor · 2020-12-17T13:52:01Z

@TomAugspurger @MJafarMashhadi @deusebio @phofl @jd
Hi Guy Have created test cases

deusebio added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2020

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Sparse Sparse Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2020

TomAugspurger added this to the Contributions Welcome milestone Jun 10, 2020

phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Nov 21, 2020

weikhor mentioned this issue Dec 17, 2020

Tests .loc on sparse DataFrame #34687 #38540

Closed

jreback modified the milestones: Contributions Welcome, 1.3 Dec 21, 2020

EricLeer mentioned this issue Mar 23, 2021

TST Add test for loc on sparse dataframes #40593

Merged

3 tasks

jreback closed this as completed in #40593 Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687

BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687

deusebio commented Jun 10, 2020 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Jun 10, 2020

deusebio commented Jun 10, 2020 •

edited

Loading

MJafarMashhadi commented Jun 17, 2020 •

edited

Loading

phofl commented Nov 21, 2020

weikhor commented Dec 17, 2020 •

edited

Loading

BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687

BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687

Comments

deusebio commented Jun 10, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jun 10, 2020

deusebio commented Jun 10, 2020 • edited Loading

MJafarMashhadi commented Jun 17, 2020 • edited Loading

phofl commented Nov 21, 2020

weikhor commented Dec 17, 2020 • edited Loading

deusebio commented Jun 10, 2020 •

edited

Loading

Output of `pd.show_versions()`

deusebio commented Jun 10, 2020 •

edited

Loading

MJafarMashhadi commented Jun 17, 2020 •

edited

Loading

weikhor commented Dec 17, 2020 •

edited

Loading