Skip to content

BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
deusebio opened this issue Jun 10, 2020 · 5 comments · Fixed by #40593
Closed
3 tasks done
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions Sparse Sparse Data Type
Milestone

Comments

@deusebio
Copy link

deusebio commented Jun 10, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here

from scipy.sparse import coo_matrix, eye  
import pandas as pd

df = pd.DataFrame.sparse.from_spmatrix(eye(10))

df.sparse.density 
# Should output 0.1

df.loc[range(5)]                                                        
#     0    1    2    3    4    5    6    7    8    9
# 0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
# 3  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
# 4  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0

df.loc[range(5)].sparse.density
# outputs 0.1

df.loc[range(5)].loc[range(3)]                                          
#     0    1    2    3    4   5   6   7   8   9
# 0  1.0  0.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
# 1  0.0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
# 2  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN

df.loc[range(5)].loc[range(3)].sparse.density
# outputs 0.6

Problem description

It seems that sparse DataFrame extracted using loc does not behave as the original ones. This creates inconsistencies in our processing pipelines, depending on filtering and selection that has been applied, sometimes producing "Nans", severely impacting memory consumption and computational time.

Expected Output

The output of loc should not depend on multiple slicing, i.e.

df.loc[range(5)].loc[range(3)] = df.loc[range(3)]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.3
setuptools : 39.0.1
Cython : 0.29.19
pytest : 5.4.3
hypothesis : 5.16.1
sphinx : 3.1.0
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.9
numba : 0.49.1

@deusebio deusebio added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2020
@TomAugspurger
Copy link
Contributor

On master I have

In [41]: df.loc[range(5)].loc[range(3)].sparse.density
Out[41]: 0.1

But for some reason the last columns are cast to Sparse[int] dtype

In [53]: df.loc[range(5)].loc[range(3)].dtypes
Out[53]:
0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
5      Sparse[int64, 0]
6      Sparse[int64, 0]
7      Sparse[int64, 0]
8      Sparse[int64, 0]
9      Sparse[int64, 0]
dtype: object

Not sure what's going on.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Sparse Sparse Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 10, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jun 10, 2020
@deusebio
Copy link
Author

deusebio commented Jun 10, 2020

Quite strange that we have different results and you cannot reproduce my outputs.

Here is the commit I have installed (bbb89ca).

Python 3.6.5 (default, Jan 14 2020, 14:04:16) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd                                                                                                          

In [2]: from scipy.sparse import coo_matrix, eye                                                                                     

In [3]: df = pd.DataFrame.sparse.from_spmatrix(eye(10))                                                                              

In [4]: df.loc[range(5)].loc[range(3)]                                                                                               
Out[4]: 
     0    1    2    3    4   5   6   7   8   9
0  1.0  0.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
1  0.0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN
2  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN

In [5]: exit         

Not sure it would matter but I'm using python 3.6.5 on a mac osx 10.15.2.

BTW, the issue about the data-type you got might be related to this other issue #34540

@MJafarMashhadi
Copy link
Contributor

MJafarMashhadi commented Jun 17, 2020

I was looking into this issue and must add it happens when the second .loc lookup is an iterable, but it's fine with slices. I mean:

In [1]: df.loc[[0, 1, 2, 3, 4, 5]].loc[[3,4,5]]                                                                                           
Out[1]: 
     0    1    2    3    4    5   6   7   8   9
3  0.0  0.0  0.0  1.0  0.0  0.0 NaN NaN NaN NaN
4  0.0  0.0  0.0  0.0  1.0  0.0 NaN NaN NaN NaN
5  0.0  0.0  0.0  0.0  0.0  1.0 NaN NaN NaN NaN

In [2]: df.loc[[0, 1, 2, 3, 4, 5]].loc[:3]  # or df.loc[range(6)].loc[:3]
Out[2]: 
     0    1    2    3    4    5    6    7    8    9
0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0

Another observation is that the number of columns it keeps depends in the first iloc lookup. See:

In [1]: df.loc[range(3)].loc[range(3)]                                                                                                   
Out[1]: 
     0    1    2   3   4   5   6   7   8   9
0  1.0  0.0  0.0 NaN NaN NaN NaN NaN NaN NaN
1  0.0  1.0  0.0 NaN NaN NaN NaN NaN NaN NaN
2  0.0  0.0  1.0 NaN NaN NaN NaN NaN NaN NaN
In [2]: df.loc[range(4)].loc[range(3)]                                                                                                   
Out[2]: 
     0    1    2    3   4   5   6   7   8   9
0  1.0  0.0  0.0  0.0 NaN NaN NaN NaN NaN NaN
1  0.0  1.0  0.0  0.0 NaN NaN NaN NaN NaN NaN
2  0.0  0.0  1.0  0.0 NaN NaN NaN NaN NaN NaN
In [3]: df.loc[range(8)].loc[range(3)]                                                                                                   
Out[3]: 
     0    1    2    3    4    5    6    7   8   9
0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 NaN NaN
1  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0 NaN NaN
2  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0 NaN NaN

@phofl
Copy link
Member

phofl commented Nov 21, 2020

This works now on master. Dtypes are fine and results equal expected. May need tests

@phofl phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Nov 21, 2020
@weikhor
Copy link
Contributor

weikhor commented Dec 17, 2020

@TomAugspurger @MJafarMashhadi @deusebio @phofl @jd
Hi Guy Have created test cases

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants