Method dropna does not work on SparseDataFrames #21172

babky · 2018-05-22T13:52:33Z

Function dropna may return wrong result on SparseDataFrame. The following code

import pandas as pd

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all')

pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')
pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all')

outputs

import pandas as pd

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all'))
    F1  F2
0  NaN   0
1  NaN   1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).dropna(axis=1, inplace=False, how='all'))
    F1  F2
0  NaN   0
1  NaN   1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).dropna(axis=1, inplace=False, how='all'))
   F1  F2
0 NaN   0
1 NaN   1

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))
    F1
0  NaN
1  NaN

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))
   F1
0 NaN
1 NaN

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1], "F3": [None, 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1], "F3": [float('nan'), 0]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2   F3
0   0  NaN
1   1  0.0

print(pd.SparseDataFrame({"F1": [None, None], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2
0   0
1   1

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).to_dense().dropna(axis=1, inplace=False, how='all'))
   F2
0   0
1   1

Problem description

dropna method behaves differently for SparseDataFrames and dense ones. Also it may happen that it does not drop nan columns at all (see the last examples in the first batch). The correct behaviour is in the second batch of commands.

Expected Output

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64  
OS: Linux       
OS-release: 4.15.0-20-generic
machine: x86_64
processor:
byteorder: little                                                                
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
                                                                                 
pandas: 0.23.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2                                                                   
numpy: 1.14.3
scipy: 1.0.1
pyarrow: None
xarray: None                                                                     
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2                                                                  
pytz: 2018.4
blosc: None
bottleneck: None
tables: None                                                                     
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None                                                                   
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1       
sqlalchemy: 1.2.7
pymysql: None     
psycopg2: None    
jinja2: 2.10
s3fs: None           
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-05-22T16:29:40Z

Can you simplify your example? Ideally focus on one copy / paste-able example that outlines the issue and expected result.

Be sure to check out the contributing guide for proper bug reporting:

https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests

babky · 2018-05-22T16:38:18Z

The most prominent sample is this one

import pandas as pd

print(pd.SparseDataFrame({"F1": [float('nan'), float('nan')], "F2": [0, 1]}).dropna(axis=1, inplace=False, how='all'))

outputs

   F1
0 NaN
1 NaN

That means that after dropping fields having NaNs a field having all NaNs is returned instead of the field F2. It should have dropped field F1.

TomAugspurger · 2018-05-22T17:34:48Z

Is this new in 0.23?

babky · 2018-05-22T18:11:05Z

Exactly I tried it in a new virtualenv and pandas v 0.23.0 has this issue and v 0.22.0 does not.

WillAyd · 2018-05-22T19:44:29Z

Thanks for the example / report. Investigation / PRs are always welcome!

babky · 2018-05-22T19:48:28Z

To conclude. I've isolated this issue into this code. A mask - SparseArray of the fields to which should be preserved is created. The mask is called nonzero operation, however that is buggy even in 0.22.0. And in both versions this code

import pandas as pd

sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0])
sa = sa.nonzero()
print(sa)

outputs 0, 3, 7. The correct result would be 2, 5, 9 - shifted by two. In pandas 0.22.0 this was resolved by using to_dense() in the process.

To resolve this - one could use to_dense() and dropna() would work and SparseArray would remain buggy. What would be of a greater value is fixing SparseArray.

WillAyd added Sparse Sparse Data Type Needs Info Clarification about behavior needed to assess issue labels May 22, 2018

WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Info Clarification about behavior needed to assess issue labels May 22, 2018

babky mentioned this issue May 22, 2018

Fix nonzero of a SparseArray #21175

Merged

4 tasks

jreback added this to the 0.24.0 milestone Nov 11, 2018

jreback closed this as completed in #21175 Nov 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method dropna does not work on SparseDataFrames #21172

Method dropna does not work on SparseDataFrames #21172

babky commented May 22, 2018

WillAyd commented May 22, 2018

babky commented May 22, 2018 •

edited

Loading

TomAugspurger commented May 22, 2018

babky commented May 22, 2018

WillAyd commented May 22, 2018

babky commented May 22, 2018

Method dropna does not work on SparseDataFrames #21172

Method dropna does not work on SparseDataFrames #21172

Comments

babky commented May 22, 2018

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented May 22, 2018

babky commented May 22, 2018 • edited Loading

TomAugspurger commented May 22, 2018

babky commented May 22, 2018

WillAyd commented May 22, 2018

babky commented May 22, 2018

Output of `pd.show_versions()`

babky commented May 22, 2018 •

edited

Loading