Skip to content

Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mjobin opened this issue Nov 1, 2019 · 8 comments
Closed
Labels
DataFrame DataFrame data structure Sparse Sparse Data Type

Comments

@mjobin
Copy link

mjobin commented Nov 1, 2019

Code Sample, a copy-pastable example if possible

import random
import numpy as np
import multiprocessing as mp
import pandas as pd


TEST_LINES = 10


samezeroint = [0 for i in range(TEST_LINES)]
sameoneint = [1 for i in range(TEST_LINES)]
samezerofloat = [0.0 for i in range(TEST_LINES)]
sameonefloat = [1.0 for i in range(TEST_LINES)]
indexone = [i for i in range(TEST_LINES)]

randomint = []
randomfloat = []

for i in range(TEST_LINES):
    randomint.append(random.randint(0,100))
    randomfloat.append(random.random())

testdict = {'indexone': indexone, "samezeroint": samezeroint, 'sameoneint': sameoneint, 'samezerofloat': samezerofloat, 'sameonefloat': sameonefloat, 'randomint': randomint, 'randomfloat': randomfloat}
filldict = {'indexone': 0, "samezeroint": 0, 'sameoneint': 1, 'samezerofloat': 0.0,
            'sameonefloat': 1.0, 'randomint': random.randint(0,100), 'randomfloat': random.random()}
dtypedict = {'indexone': np.int8, "samezeroint": np.int8, 'sameoneint': np.int8, 'samezerofloat': np.float,
            'sameonefloat': np.float, 'randomint': np.int8, 'randomfloat': np.float}


dospar = {}
for l in testdict:
    try:
        fill = filldict[l]
    except KeyError:
        fill = None
    try:
        datatype = dtypedict[l]
    except KeyError:
        datatype = np.str
    if fill is None:
        sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
    else:
        sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
    dospar[l] = sparr
testdf = pd.DataFrame.from_dict(dospar, orient='columns')

# Test a single series

print("\n\nSeries: All zeroes")
samezerointseries = pd.Series(pd.SparseArray(testdict['samezeroint'], fill_value=0, dtype=np.int8))
print("\nOriginal")
print(samezerointseries)
samezero = samezerointseries.isin([0])
samezerozero = samezerointseries[samezero]
print("\nFiltered: should be identical to above")
print(samezerozero)
sameone = samezerointseries.isin([1])
samezeroone = samezerointseries[sameone]
print("\nFiltered: should be empty")
print(samezeroone)


print("\n\nDataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(testdf)
    print(testdf.dtypes)

print("\n\nDataframe: should be identical to above")
intone = testdf.loc[:, 'sameoneint'].isin([int(1)])
print(intone)
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(onedf)
    print(onedf.dtypes)

Problem description

I've run into something odd with pandas data frames composed of sparse series. I am able to make the DF out of dictionaries of values with fills and types, no problem, but when I try to subset that DF, I get some very weird results. What I have been able to reproducibly show is that when subsetting a DF created out of Sparse series, if a column happens to be identical throughout (i.e. all entires match the fill value), the subset DF turns those columns into NaN's, and the dtype convert into float64.

When I run this test, I get the following result:

'''
Series: All zeroes

Original
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]

Filtered: should be identical to above
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]

Filtered: should be empty
Series([], dtype: Sparse[int8, 0])

Dataframe:
indexone samezeroint sameoneint samezerofloat sameonefloat randomint
0 0 0 1 0.0 1.0 95
1 1 0 1 0.0 1.0 13
2 2 0 1 0.0 1.0 20
3 3 0 1 0.0 1.0 29
4 4 0 1 0.0 1.0 44
5 5 0 1 0.0 1.0 82
6 6 0 1 0.0 1.0 22
7 7 0 1 0.0 1.0 91
8 8 0 1 0.0 1.0 51
9 9 0 1 0.0 1.0 13

randomfloat densesamezero
0 0.989141 0
1 0.948800 0
2 0.504600 0
3 0.014699 0
4 0.979710 0
5 0.663198 0
6 0.773480 0
7 0.921397 0
8 0.898101 0
9 0.445575 0
indexone Sparse[int8, 0]
samezeroint Sparse[int8, 0]
sameoneint Sparse[int8, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 24]
randomfloat Sparse[float64, 0.08770512017933063]
densesamezero int64
dtype: object

Dataframe: should be identical to above
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
Name: sameoneint, dtype: bool
indexone samezeroint sameoneint samezerofloat sameonefloat randomint
0 0 NaN NaN NaN NaN 95
1 1 NaN NaN NaN NaN 13
2 2 NaN NaN NaN NaN 20
3 3 NaN NaN NaN NaN 29
4 4 NaN NaN NaN NaN 44
5 5 NaN NaN NaN NaN 82
6 6 NaN NaN NaN NaN 22
7 7 NaN NaN NaN NaN 91
8 8 NaN NaN NaN NaN 51
9 9 NaN NaN NaN NaN 13

randomfloat densesamezero
0 0.989141 0
1 0.948800 0
2 0.504600 0
3 0.014699 0
4 0.979710 0
5 0.663198 0
6 0.773480 0
7 0.921397 0
8 0.898101 0
9 0.445575 0
indexone Sparse[int64, 0]
samezeroint Sparse[float64, 0]
sameoneint Sparse[float64, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 24]
randomfloat Sparse[float64, 0.08770512017933063]
densesamezero int64
dtype: object

As you can hopefully see, since I am subsetting "all rows where there is a 0 in the column that is all zeroes", I SHOULD be creating an identical DF - but instead only the Series where there is some variation in the column are preserved, the rest turn into all-NaNs.

For the subsetting command, I have tried every variant I could find:

newdf = testdf.loc[testdf['sameoneint'] == 1]

newdf =testdf.query('sameoneint == 1')

isone = testdf.loc[:, 'sameoneint'].isin([1])

newdf = testdf[isone]
None of those work any better, and some throw a warning about calling to_dense.

So, am I missing something in how I coded that, or is this something in how pandas works I have not yet figured out? Advice most welcome!

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-693.11.6.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@chriswickham100
Copy link

Hey @mjobin, I'm going to take a look at this. I'm working through your example and I'll have some thoughts by tomorrow at some point.

@mjobin
Copy link
Author

mjobin commented Nov 6, 2019

Wonderful, thank you! In the meantime, I put together a workaround that seems to further highlight the problem. If I add a few rows of random data to the end of a DataFrame, thus making sure no column can entirely match the fill value, the NaN-creating problem disappears.

@chriswickham100
Copy link

Huh-- so I'm definitely getting the same results! I'm going to keep fooling around with it and the source code, and see if this warrants a documentation addition or a pull request. I'm relatively new to pandas, so if you or someone else comes up with something before I do, feel free to go ahead-- going to take my time.

@gfyoung gfyoung added DataFrame DataFrame data structure Sparse Sparse Data Type labels Nov 6, 2019
@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2019

@TomAugspurger
Copy link
Contributor

Is this a fully minimal example? I'm having trouble following it.

if a column happens to be identical throughout (i.e. all entires match the fill value), the subset DF turns those columns into NaN's, and the dtype convert into float64.

Which part of the example shows this? Is this it?

In [8]: pd.DataFrame({"A": pd.SparseArray([1., 1., 1.], fill_value=1.)}).loc[[0, 2]]
Out[8]:
    A
0 NaN
2 NaN

@TomAugspurger
Copy link
Contributor

This sounds similar to #27781. Is this a duplicate?

@mjobin
Copy link
Author

mjobin commented Nov 7, 2019

I tried to make the example as minimal as possible while showing when this does and does not occur. From the code I put in above, sparse columns where all values match the fill value turn into NaN's when the DF is manipulated — but this does not affect dense columns nor sparse columns where there is some variation in the values. Here's a smaller example that only compares a sparse and dense column of zeroes:
`import random
import numpy as np
import multiprocessing as mp
import pandas as pd

TEST_LINES = 10

samezeroint = [0 for i in range(TEST_LINES)]
densesamezeroint = [0 for i in range(TEST_LINES)]

testdict = {"samezeroint": samezeroint}
filldict = {"samezeroint": 0}
dtypedict = {"samezeroint": np.int8}

dospar = {}
for l in testdict:
try:
fill = filldict[l]
except KeyError:
fill = None
try:
datatype = dtypedict[l]
except KeyError:
datatype = np.str
if fill is None:
sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
else:
sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
dospar[l] = sparr
dospar["densesamezero"] = densesamezeroint
testdf = pd.DataFrame.from_dict(dospar, orient='columns')

Test a single series

print("\n\nINITIAL Dataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
None): # more options can be specified also
print(testdf)
print(testdf.dtypes)

print("\n\nSUBSET Dataframe: should be identical to above since i am selecting rows where an all-zero column is a zero, which is all of them.")
intone = testdf.loc[:, 'samezeroint'].isin([int(0)])
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
None): # more options can be specified also
print(onedf)
print(onedf.dtypes)`

When I run this, I see the result:

`INITIAL Dataframe:
samezeroint densesamezero
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
samezeroint Sparse[int8, 0]
densesamezero int64
dtype: object

SUBSET Dataframe: should be identical to above since i am selecting rows where an all-zero column is a zero, which is all of them.
samezeroint densesamezero
0 NaN 0
1 NaN 0
2 NaN 0
3 NaN 0
4 NaN 0
5 NaN 0
6 NaN 0
7 NaN 0
8 NaN 0
9 NaN 0
samezeroint Sparse[float64, 0]
densesamezero int64
dtype: object`

This is similar to #27781, but with my larger example I wanted to show that this was not just restricted too all-zero columns. It happens when any sparse column consists entirely of entries that match the fill value.

@mroeschke
Copy link
Member

I'll link your comment #29321 (comment) to #27781, as I believe solving #27781 will solve this issue as well.

Closing and continuing the discussion in #27781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

5 participants