-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: multiple application of .loc on sparse DataFrame results in NaN and filling the DataFrame #34687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On master I have In [41]: df.loc[range(5)].loc[range(3)].sparse.density
Out[41]: 0.1 But for some reason the last columns are cast to Sparse[int] dtype In [53]: df.loc[range(5)].loc[range(3)].dtypes
Out[53]:
0 Sparse[float64, 0]
1 Sparse[float64, 0]
2 Sparse[float64, 0]
3 Sparse[float64, 0]
4 Sparse[float64, 0]
5 Sparse[int64, 0]
6 Sparse[int64, 0]
7 Sparse[int64, 0]
8 Sparse[int64, 0]
9 Sparse[int64, 0]
dtype: object Not sure what's going on. |
Quite strange that we have different results and you cannot reproduce my outputs. Here is the commit I have installed (bbb89ca).
Not sure it would matter but I'm using python 3.6.5 on a mac osx 10.15.2. BTW, the issue about the data-type you got might be related to this other issue #34540 |
I was looking into this issue and must add it happens when the second
Another observation is that the number of columns it keeps depends in the first iloc lookup. See:
|
This works now on master. Dtypes are fine and results equal expected. May need tests |
@TomAugspurger @MJafarMashhadi @deusebio @phofl @jd |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
It seems that sparse DataFrame extracted using loc does not behave as the original ones. This creates inconsistencies in our processing pipelines, depending on filtering and selection that has been applied, sometimes producing "Nans", severely impacting memory consumption and computational time.
Expected Output
The output of loc should not depend on multiple slicing, i.e.
df.loc[range(5)].loc[range(3)] = df.loc[range(3)]
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.3
setuptools : 39.0.1
Cython : 0.29.19
pytest : 5.4.3
hypothesis : 5.16.1
sphinx : 3.1.0
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.9
numba : 0.49.1
The text was updated successfully, but these errors were encountered: