Skip to content

Method dropna does not work on SparseDataFrames #21172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
babky opened this issue May 22, 2018 · 6 comments · Fixed by #21175
Closed

Method dropna does not work on SparseDataFrames #21172

babky opened this issue May 22, 2018 · 6 comments · Fixed by #21175
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type
Milestone

Comments

@babky
Copy link
Contributor

babky commented May 22, 2018

Function dropna may return wrong result on SparseDataFrame. The following code

import pandas as pd

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')

outputs

import pandas as pd

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all'))
    F1  F2
0  NaN   0
1  NaN   1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all'))
    F1  F2
0  NaN   0
1  NaN   1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all'))
   F1  F2
0 NaN   0
1 NaN   1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))
    F1
0  NaN
1  NaN

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))
   F1
0 NaN
1 NaN

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2
0   0
1   1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2
0   0
1   1

Problem description

dropna method behaves differently for SparseDataFrames and dense ones. Also it may happen that it does not drop nan columns at all (see the last examples in the first batch). The correct behaviour is in the second batch of commands.

Expected Output

   F2   F3
0   0  NaN
1   1  0.0

   F2   F3
0   0  NaN
1   1  0.0

   F2   F3
0   0  NaN
1   1  0.0

   F2
0   0
1   1

   F2
0   0
1   1

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64  
OS: Linux       
OS-release: 4.15.0-20-generic
machine: x86_64
processor:
byteorder: little                                                                
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
                                                                                 
pandas: 0.23.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2                                                                   
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None                                                                     
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2                                                                  
pytz: 2018.4
blosc: None
bottleneck: None
tables: None                                                                     
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None                                                                   
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1       
sqlalchemy: 1.2.7
pymysql: None     
psycopg2: None    
jinja2: 2.10
s3fs: None           
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@WillAyd
Copy link
Member

WillAyd commented May 22, 2018

Can you simplify your example? Ideally focus on one copy / paste-able example that outlines the issue and expected result.

Be sure to check out the contributing guide for proper bug reporting:

https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests

@WillAyd WillAyd added Sparse Sparse Data Type Needs Info Clarification about behavior needed to assess issue labels May 22, 2018
@babky
Copy link
Contributor Author

babky commented May 22, 2018

The most prominent sample is this one

import pandas as pd

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))

outputs

   F1
0 NaN
1 NaN

That means that after dropping fields having NaNs a field having all NaNs is returned instead of the field F2. It should have dropped field F1.

@TomAugspurger
Copy link
Contributor

Is this new in 0.23?

@babky
Copy link
Contributor Author

babky commented May 22, 2018

Exactly I tried it in a new virtualenv and pandas v 0.23.0 has this issue and v 0.22.0 does not.

@WillAyd
Copy link
Member

WillAyd commented May 22, 2018

Thanks for the example / report. Investigation / PRs are always welcome!

@WillAyd WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Info Clarification about behavior needed to assess issue labels May 22, 2018
@babky
Copy link
Contributor Author

babky commented May 22, 2018

To conclude. I've isolated this issue into this code. A mask - SparseArray of the fields to which should be preserved is created. The mask is called nonzero operation, however that is buggy even in 0.22.0. And in both versions this code

import pandas as pd

sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0])
sa = sa.nonzero()
print(sa)

outputs 0, 3, 7. The correct result would be 2, 5, 9 - shifted by two. In pandas 0.22.0 this was resolved by using to_dense() in the process.

To resolve this - one could use to_dense() and dropna() would work and SparseArray would remain buggy. What would be of a greater value is fixing SparseArray.

@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants