Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

mjobin · 2019-11-01T14:06:36Z

Code Sample, a copy-pastable example if possible

import random
import numpy as np
import multiprocessing as mp
import pandas as pd


TEST_LINES = 10


samezeroint = [0 for i in range(TEST_LINES)]
sameoneint = [1 for i in range(TEST_LINES)]
samezerofloat = [0.0 for i in range(TEST_LINES)]
sameonefloat = [1.0 for i in range(TEST_LINES)]
indexone = [i for i in range(TEST_LINES)]

randomint = []
randomfloat = []

for i in range(TEST_LINES):
    randomint.append(random.randint(0,100))
    randomfloat.append(random.random())

testdict = {'indexone': indexone, "samezeroint": samezeroint, 'sameoneint': sameoneint, 'samezerofloat': samezerofloat, 'sameonefloat': sameonefloat, 'randomint': randomint, 'randomfloat': randomfloat}
filldict = {'indexone': 0, "samezeroint": 0, 'sameoneint': 1, 'samezerofloat': 0.0,
            'sameonefloat': 1.0, 'randomint': random.randint(0,100), 'randomfloat': random.random()}
dtypedict = {'indexone': np.int8, "samezeroint": np.int8, 'sameoneint': np.int8, 'samezerofloat': np.float,
            'sameonefloat': np.float, 'randomint': np.int8, 'randomfloat': np.float}


dospar = {}
for l in testdict:
    try:
        fill = filldict[l]
    except KeyError:
        fill = None
    try:
        datatype = dtypedict[l]
    except KeyError:
        datatype = np.str
    if fill is None:
        sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
    else:
        sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
    dospar[l] = sparr
testdf = pd.DataFrame.from_dict(dospar, orient='columns')

# Test a single series

print("\n\nSeries: All zeroes")
samezerointseries = pd.Series(pd.SparseArray(testdict['samezeroint'], fill_value=0, dtype=np.int8))
print("\nOriginal")
print(samezerointseries)
samezero = samezerointseries.isin([0])
samezerozero = samezerointseries[samezero]
print("\nFiltered: should be identical to above")
print(samezerozero)
sameone = samezerointseries.isin([1])
samezeroone = samezerointseries[sameone]
print("\nFiltered: should be empty")
print(samezeroone)


print("\n\nDataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(testdf)
    print(testdf.dtypes)

print("\n\nDataframe: should be identical to above")
intone = testdf.loc[:, 'sameoneint'].isin([int(1)])
print(intone)
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(onedf)
    print(onedf.dtypes)

Problem description

I've run into something odd with pandas data frames composed of sparse series. I am able to make the DF out of dictionaries of values with fills and types, no problem, but when I try to subset that DF, I get some very weird results. What I have been able to reproducibly show is that when subsetting a DF created out of Sparse series, if a column happens to be identical throughout (i.e. all entires match the fill value), the subset DF turns those columns into NaN's, and the dtype convert into float64.

When I run this test, I get the following result:

'''
Series: All zeroes

Original
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]

Filtered: should be identical to above
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]

Filtered: should be empty
Series([], dtype: Sparse[int8, 0])

Dataframe:
indexone samezeroint sameoneint samezerofloat sameonefloat randomint
0 0 0 1 0.0 1.0 95
1 1 0 1 0.0 1.0 13
2 2 0 1 0.0 1.0 20
3 3 0 1 0.0 1.0 29
4 4 0 1 0.0 1.0 44
5 5 0 1 0.0 1.0 82
6 6 0 1 0.0 1.0 22
7 7 0 1 0.0 1.0 91
8 8 0 1 0.0 1.0 51
9 9 0 1 0.0 1.0 13

randomfloat densesamezero
0 0.989141 0
1 0.948800 0
2 0.504600 0
3 0.014699 0
4 0.979710 0
5 0.663198 0
6 0.773480 0
7 0.921397 0
8 0.898101 0
9 0.445575 0
indexone Sparse[int8, 0]
samezeroint Sparse[int8, 0]
sameoneint Sparse[int8, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 24]
randomfloat Sparse[float64, 0.08770512017933063]
densesamezero int64
dtype: object

Dataframe: should be identical to above
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
Name: sameoneint, dtype: bool
indexone samezeroint sameoneint samezerofloat sameonefloat randomint
0 0 NaN NaN NaN NaN 95
1 1 NaN NaN NaN NaN 13
2 2 NaN NaN NaN NaN 20
3 3 NaN NaN NaN NaN 29
4 4 NaN NaN NaN NaN 44
5 5 NaN NaN NaN NaN 82
6 6 NaN NaN NaN NaN 22
7 7 NaN NaN NaN NaN 91
8 8 NaN NaN NaN NaN 51
9 9 NaN NaN NaN NaN 13

randomfloat densesamezero
0 0.989141 0
1 0.948800 0
2 0.504600 0
3 0.014699 0
4 0.979710 0
5 0.663198 0
6 0.773480 0
7 0.921397 0
8 0.898101 0
9 0.445575 0
indexone Sparse[int64, 0]
samezeroint Sparse[float64, 0]
sameoneint Sparse[float64, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 24]
randomfloat Sparse[float64, 0.08770512017933063]
densesamezero int64
dtype: object

As you can hopefully see, since I am subsetting "all rows where there is a 0 in the column that is all zeroes", I SHOULD be creating an identical DF - but instead only the Series where there is some variation in the column are preserved, the rest turn into all-NaNs.

For the subsetting command, I have tried every variant I could find:

newdf = testdf.loc[testdf['sameoneint'] == 1]

newdf =testdf.query('sameoneint == 1')

isone = testdf.loc[:, 'sameoneint'].isin([1])

newdf = testdf[isone]
None of those work any better, and some throw a warning about calling to_dense.

So, am I missing something in how I coded that, or is this something in how pandas works I have not yet figured out? Advice most welcome!

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-693.11.6.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

chriswickham100 · 2019-11-05T23:28:36Z

Hey @mjobin, I'm going to take a look at this. I'm working through your example and I'll have some thoughts by tomorrow at some point.

mjobin · 2019-11-06T14:26:07Z

Wonderful, thank you! In the meantime, I put together a workaround that seems to further highlight the problem. If I add a few rows of random data to the end of a DataFrame, thus making sure no column can entirely match the fill value, the NaN-creating problem disappears.

chriswickham100 · 2019-11-06T15:25:46Z

Huh-- so I'm definitely getting the same results! I'm going to keep fooling around with it and the source code, and see if this warrants a documentation addition or a pull request. I'm relatively new to pandas, so if you or someone else comes up with something before I do, feel free to go ahead-- going to take my time.

gfyoung · 2019-11-06T20:00:53Z

cc @TomAugspurger @jorisvandenbossche

TomAugspurger · 2019-11-07T16:13:31Z

Is this a fully minimal example? I'm having trouble following it.

if a column happens to be identical throughout (i.e. all entires match the fill value), the subset DF turns those columns into NaN's, and the dtype convert into float64.

Which part of the example shows this? Is this it?

In [8]: pd.DataFrame({"A": pd.SparseArray([1., 1., 1.], fill_value=1.)}).loc[[0, 2]]
Out[8]:
    A
0 NaN
2 NaN

TomAugspurger · 2019-11-07T16:14:23Z

This sounds similar to #27781. Is this a duplicate?

mjobin · 2019-11-07T20:32:56Z

I tried to make the example as minimal as possible while showing when this does and does not occur. From the code I put in above, sparse columns where all values match the fill value turn into NaN's when the DF is manipulated — but this does not affect dense columns nor sparse columns where there is some variation in the values. Here's a smaller example that only compares a sparse and dense column of zeroes:
`import random
import numpy as np
import multiprocessing as mp
import pandas as pd

TEST_LINES = 10

samezeroint = [0 for i in range(TEST_LINES)]
densesamezeroint = [0 for i in range(TEST_LINES)]

testdict = {"samezeroint": samezeroint}
filldict = {"samezeroint": 0}
dtypedict = {"samezeroint": np.int8}

dospar = {}
for l in testdict:
try:
fill = filldict[l]
except KeyError:
fill = None
try:
datatype = dtypedict[l]
except KeyError:
datatype = np.str
if fill is None:
sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
else:
sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
dospar[l] = sparr
dospar["densesamezero"] = densesamezeroint
testdf = pd.DataFrame.from_dict(dospar, orient='columns')

Test a single series

print("\n\nINITIAL Dataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
None): # more options can be specified also
print(testdf)
print(testdf.dtypes)

print("\n\nSUBSET Dataframe: should be identical to above since i am selecting rows where an all-zero column is a zero, which is all of them.")
intone = testdf.loc[:, 'samezeroint'].isin([int(0)])
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
None): # more options can be specified also
print(onedf)
print(onedf.dtypes)`

When I run this, I see the result:

`INITIAL Dataframe:
samezeroint densesamezero
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
samezeroint Sparse[int8, 0]
densesamezero int64
dtype: object

SUBSET Dataframe: should be identical to above since i am selecting rows where an all-zero column is a zero, which is all of them.
samezeroint densesamezero
0 NaN 0
1 NaN 0
2 NaN 0
3 NaN 0
4 NaN 0
5 NaN 0
6 NaN 0
7 NaN 0
8 NaN 0
9 NaN 0
samezeroint Sparse[float64, 0]
densesamezero int64
dtype: object`

This is similar to #27781, but with my larger example I wanted to show that this was not just restricted too all-zero columns. It happens when any sparse column consists entirely of entries that match the fill value.

mroeschke · 2020-04-20T00:36:24Z

I'll link your comment #29321 (comment) to #27781, as I believe solving #27781 will solve this issue as well.

Closing and continuing the discussion in #27781

gfyoung added DataFrame DataFrame data structure Sparse Sparse Data Type labels Nov 6, 2019

mroeschke closed this as completed Apr 20, 2020

mroeschke mentioned this issue Apr 20, 2020

Filtering dataframe with sparse column leads to NAs in sparse column #27781

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

mjobin commented Nov 1, 2019 •

edited

Loading

INSTALLED VERSIONS

chriswickham100 commented Nov 5, 2019

mjobin commented Nov 6, 2019

chriswickham100 commented Nov 6, 2019

gfyoung commented Nov 6, 2019

TomAugspurger commented Nov 7, 2019

TomAugspurger commented Nov 7, 2019

mjobin commented Nov 7, 2019

mroeschke commented Apr 20, 2020

Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

Comments

mjobin commented Nov 1, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

chriswickham100 commented Nov 5, 2019

mjobin commented Nov 6, 2019

chriswickham100 commented Nov 6, 2019

gfyoung commented Nov 6, 2019

TomAugspurger commented Nov 7, 2019

TomAugspurger commented Nov 7, 2019

mjobin commented Nov 7, 2019

Test a single series

mroeschke commented Apr 20, 2020

mjobin commented Nov 1, 2019 •

edited

Loading

Output of `pd.show_versions()`