BUG: Printing DataFrame in Jupyter Notebook modifies DataFrame State #43815

arminherbsthofer · 2021-09-30T11:43:43Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

### Notebook Cell 1 ###
import pandas as pd
data = pd.DataFrame([i for i in range(100)], index=pd.date_range(start="2000-01-01", freq="1H", periods=100), columns=["data_col_1"])
### Notebook Cell 2 ###
data
### Notebook Call 3 ###
data = data.at_time("10:00")
data["data_col_2"] = 0
data.loc[data["data_col_1"] > 0.0, "data_col_2"] = 1

Issue Description

If I run the 3 notebook cells above, I get the following pandas warning:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

However, if I only run cell 1 and 3, no warning is triggered. The only difference is cell 2 where we just print the DataFrame. To be more specific, running print(data) instead of just data also does not trigger the warning.

I do not believe this to be a jupyter notebook bug but it seems that by simply running a cell with data, the state of the DataFrame is in some way modified which is definitely not expected behavior. Probably this bug is due to some pandas styling functions that are triggered in the background when a cell is run on a DataFrame.

As a side note, the bug also does not appear when all the code is run in one single cell (probably because no DataFrame is printed) or when the line data = data.at_time("10:00") is omitted.

Expected Behavior

The triggering of the warning should not be dependent on whether you run cell 2 or not.

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-37-generic
Version : #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.3
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

The text was updated successfully, but these errors were encountered:

Liam3851 · 2021-10-01T01:18:51Z

I don't believe this is a bug, but you've hit an interesting corner case.

In Jupyter (actually in all IPython instances including consoles), all results returned from a cell are stored as variables in RAM. This allows you to reference the result of cell 2 as _2 later on, for example. They are not garbage collected unless you explicitly delete them.

When you run cell 3:

### Notebook Call 3 ###
data = data.at_time("10:00")
data["data_col_2"] = 0

at_time creates a view on data from cell 1, and stores that view to the variable data. If cell 2 has not been run, the original data from cell 1 is immediately garbage collected since it has no references, and so the new view is the only version of data that exists. Thus the second line assigning 'data_col_2' is safe to run.

If cell 2 has been run, the data variable created in cell 1 still exists as _2. So you can think of cell 3 then as:

### Notebook Call 3 ###
data = _2.at_time("10:00") # data is now a view of _2
data["data_col_2"] = 0 # this line would potentially modify both data and _2, and so warns

I'm not sure there's any way around this because a user could always do something like:

data # cell 2
.... # lots of cells
x = _2
data = data.at_time('10:00') # potentially conflicts with expectation of x

so even if there existed some way to detect if the original DataFrame were stored as a Jupyter result (which I don't think there is) it is always possible for the result to be stored later as some other strong reference.

mroeschke · 2021-10-02T00:42:07Z

Thanks for the explanation @Liam3851. Agreed if this is an artifact of iPython, then there is not really anything more to do from the pandas side. Closing as a usage question

arminherbsthofer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2021

mroeschke closed this as completed Oct 2, 2021

mroeschke added Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Printing DataFrame in Jupyter Notebook modifies DataFrame State #43815

BUG: Printing DataFrame in Jupyter Notebook modifies DataFrame State #43815

arminherbsthofer commented Sep 30, 2021 •

edited

Loading

INSTALLED VERSIONS

Liam3851 commented Oct 1, 2021

mroeschke commented Oct 2, 2021

BUG: Printing DataFrame in Jupyter Notebook modifies DataFrame State #43815

BUG: Printing DataFrame in Jupyter Notebook modifies DataFrame State #43815

Comments

arminherbsthofer commented Sep 30, 2021 • edited Loading

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

Liam3851 commented Oct 1, 2021

mroeschke commented Oct 2, 2021

arminherbsthofer commented Sep 30, 2021 •

edited

Loading