-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @mjobin, I'm going to take a look at this. I'm working through your example and I'll have some thoughts by tomorrow at some point. |
Wonderful, thank you! In the meantime, I put together a workaround that seems to further highlight the problem. If I add a few rows of random data to the end of a DataFrame, thus making sure no column can entirely match the fill value, the NaN-creating problem disappears. |
Huh-- so I'm definitely getting the same results! I'm going to keep fooling around with it and the source code, and see if this warrants a documentation addition or a pull request. I'm relatively new to pandas, so if you or someone else comes up with something before I do, feel free to go ahead-- going to take my time. |
Is this a fully minimal example? I'm having trouble following it.
Which part of the example shows this? Is this it?
|
This sounds similar to #27781. Is this a duplicate? |
I tried to make the example as minimal as possible while showing when this does and does not occur. From the code I put in above, sparse columns where all values match the fill value turn into NaN's when the DF is manipulated — but this does not affect dense columns nor sparse columns where there is some variation in the values. Here's a smaller example that only compares a sparse and dense column of zeroes: TEST_LINES = 10 samezeroint = [0 for i in range(TEST_LINES)] testdict = {"samezeroint": samezeroint} dospar = {} Test a single seriesprint("\n\nINITIAL Dataframe:") print("\n\nSUBSET Dataframe: should be identical to above since i am selecting rows where an all-zero column is a zero, which is all of them.") When I run this, I see the result: `INITIAL Dataframe: SUBSET Dataframe: should be identical to above since i am selecting rows where an all-zero column is a zero, which is all of them. This is similar to #27781, but with my larger example I wanted to show that this was not just restricted too all-zero columns. It happens when any sparse column consists entirely of entries that match the fill value. |
I'll link your comment #29321 (comment) to #27781, as I believe solving #27781 will solve this issue as well. Closing and continuing the discussion in #27781 |
Code Sample, a copy-pastable example if possible
Problem description
I've run into something odd with pandas data frames composed of sparse series. I am able to make the DF out of dictionaries of values with fills and types, no problem, but when I try to subset that DF, I get some very weird results. What I have been able to reproducibly show is that when subsetting a DF created out of Sparse series, if a column happens to be identical throughout (i.e. all entires match the fill value), the subset DF turns those columns into NaN's, and the dtype convert into float64.
When I run this test, I get the following result:
'''
Series: All zeroes
Original
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]
Filtered: should be identical to above
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: Sparse[int8, 0]
Filtered: should be empty
Series([], dtype: Sparse[int8, 0])
Dataframe:
indexone samezeroint sameoneint samezerofloat sameonefloat randomint
0 0 0 1 0.0 1.0 95
1 1 0 1 0.0 1.0 13
2 2 0 1 0.0 1.0 20
3 3 0 1 0.0 1.0 29
4 4 0 1 0.0 1.0 44
5 5 0 1 0.0 1.0 82
6 6 0 1 0.0 1.0 22
7 7 0 1 0.0 1.0 91
8 8 0 1 0.0 1.0 51
9 9 0 1 0.0 1.0 13
randomfloat densesamezero
0 0.989141 0
1 0.948800 0
2 0.504600 0
3 0.014699 0
4 0.979710 0
5 0.663198 0
6 0.773480 0
7 0.921397 0
8 0.898101 0
9 0.445575 0
indexone Sparse[int8, 0]
samezeroint Sparse[int8, 0]
sameoneint Sparse[int8, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 24]
randomfloat Sparse[float64, 0.08770512017933063]
densesamezero int64
dtype: object
Dataframe: should be identical to above
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
Name: sameoneint, dtype: bool
indexone samezeroint sameoneint samezerofloat sameonefloat randomint
0 0 NaN NaN NaN NaN 95
1 1 NaN NaN NaN NaN 13
2 2 NaN NaN NaN NaN 20
3 3 NaN NaN NaN NaN 29
4 4 NaN NaN NaN NaN 44
5 5 NaN NaN NaN NaN 82
6 6 NaN NaN NaN NaN 22
7 7 NaN NaN NaN NaN 91
8 8 NaN NaN NaN NaN 51
9 9 NaN NaN NaN NaN 13
randomfloat densesamezero
0 0.989141 0
1 0.948800 0
2 0.504600 0
3 0.014699 0
4 0.979710 0
5 0.663198 0
6 0.773480 0
7 0.921397 0
8 0.898101 0
9 0.445575 0
indexone Sparse[int64, 0]
samezeroint Sparse[float64, 0]
sameoneint Sparse[float64, 1]
samezerofloat Sparse[float64, 0.0]
sameonefloat Sparse[float64, 1.0]
randomint Sparse[int8, 24]
randomfloat Sparse[float64, 0.08770512017933063]
densesamezero int64
dtype: object
As you can hopefully see, since I am subsetting "all rows where there is a 0 in the column that is all zeroes", I SHOULD be creating an identical DF - but instead only the Series where there is some variation in the column are preserved, the rest turn into all-NaNs.
For the subsetting command, I have tried every variant I could find:
newdf = testdf.loc[testdf['sameoneint'] == 1]
newdf =testdf.query('sameoneint == 1')
isone = testdf.loc[:, 'sameoneint'].isin([1])
newdf = testdf[isone]
None of those work any better, and some throw a warning about calling to_dense.
So, am I missing something in how I coded that, or is this something in how pandas works I have not yet figured out? Advice most welcome!
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-693.11.6.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: