Skip to content

BUG: Printing DataFrame in Jupyter Notebook modifies DataFrame State #43815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
arminherbsthofer opened this issue Sep 30, 2021 · 2 comments
Closed
3 tasks done

Comments

@arminherbsthofer
Copy link

arminherbsthofer commented Sep 30, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

### Notebook Cell 1 ###
import pandas as pd
data = pd.DataFrame([i for i in range(100)], index=pd.date_range(start="2000-01-01", freq="1H", periods=100), columns=["data_col_1"])
### Notebook Cell 2 ###
data
### Notebook Call 3 ###
data = data.at_time("10:00")
data["data_col_2"] = 0
data.loc[data["data_col_1"] > 0.0, "data_col_2"] = 1

Issue Description

If I run the 3 notebook cells above, I get the following pandas warning:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

However, if I only run cell 1 and 3, no warning is triggered. The only difference is cell 2 where we just print the DataFrame. To be more specific, running print(data) instead of just data also does not trigger the warning.

I do not believe this to be a jupyter notebook bug but it seems that by simply running a cell with data, the state of the DataFrame is in some way modified which is definitely not expected behavior. Probably this bug is due to some pandas styling functions that are triggered in the background when a cell is run on a DataFrame.

As a side note, the bug also does not appear when all the code is run in one single cell (probably because no DataFrame is printed) or when the line data = data.at_time("10:00") is omitted.

Expected Behavior

The triggering of the warning should not be dependent on whether you run cell 2 or not.

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-37-generic
Version : #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.3
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

@arminherbsthofer arminherbsthofer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2021
@Liam3851
Copy link
Contributor

Liam3851 commented Oct 1, 2021

I don't believe this is a bug, but you've hit an interesting corner case.

In Jupyter (actually in all IPython instances including consoles), all results returned from a cell are stored as variables in RAM. This allows you to reference the result of cell 2 as _2 later on, for example. They are not garbage collected unless you explicitly delete them.

When you run cell 3:

### Notebook Call 3 ###
data = data.at_time("10:00")
data["data_col_2"] = 0

at_time creates a view on data from cell 1, and stores that view to the variable data. If cell 2 has not been run, the original data from cell 1 is immediately garbage collected since it has no references, and so the new view is the only version of data that exists. Thus the second line assigning 'data_col_2' is safe to run.

If cell 2 has been run, the data variable created in cell 1 still exists as _2. So you can think of cell 3 then as:

### Notebook Call 3 ###
data = _2.at_time("10:00") # data is now a view of _2
data["data_col_2"] = 0 # this line would potentially modify both data and _2, and so warns

I'm not sure there's any way around this because a user could always do something like:

data # cell 2
.... # lots of cells
x = _2
data = data.at_time('10:00') # potentially conflicts with expectation of x

so even if there existed some way to detect if the original DataFrame were stored as a Jupyter result (which I don't think there is) it is always possible for the result to be stored later as some other strong reference.

@mroeschke
Copy link
Member

Thanks for the explanation @Liam3851. Agreed if this is an artifact of iPython, then there is not really anything more to do from the pandas side. Closing as a usage question

@mroeschke mroeschke added Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants