Enabling chained assignment checks (SettingWithCopyWarning) can have huge performance impact #18743

bluenote10 · 2017-12-12T15:20:09Z

Similar to an observation on reddit I noticed that there is a huge performance difference between the default pandas pd.options.mode.chained_assignment = 'warn' over setting it to None.

Code Sample

import time
import pandas as pd
import numpy as np

def gen_data(N=10000):
    df = pd.DataFrame(index=range(N))
    for c in range(10):
        df[str(c)] = np.random.uniform(size=N)
    df["id"] = np.random.choice(range(500), size=len(df))
    return df

def do_something_on_df(df):
    """ Dummy computation that contains inplace mutations """
    for c in range(df.shape[1]):
        df[str(c)] = np.random.uniform(size=df.shape[0])
    return 42

def run_test(mode="warn"):
    pd.options.mode.chained_assignment = mode

    df = gen_data()

    t1 = time.time()
    for key, group_df in df.groupby("id"):
        do_something_on_df(group_df)
    t2 = time.time()
    print("Runtime: {:10.3f} sec".format(t2 - t1))

if __name__ == "__main__":
    run_test(mode="warn")
    run_test(mode=None)

Problem description

The run times vary a lot depending on the whether the SettingWithCopyWarning is enabled or disable. I tried with a few different Pandas/Python versions:

Debian VM, Python 3.6.2, pandas 0.21.0
Runtime:     46.693 sec
Runtime:      0.731 sec

Debian VM, Python 2.7.9, pandas 0.20.0
Runtime:    101.204 sec
Runtime:      0.622 sec

Ubuntu (host), Python 2.7.3, pandas 0.21.0
Runtime:     35.363 sec
Runtime:      0.517 sec

Ideally, there should not be such a big penalty for SettingWithCopyWarning.

From profiling results it looks like the reason might be this call to gc.collect.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-4-amd64 machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-12T15:28:21Z

of course, this has to run the garbage collector. You can certainly just disable them. This wont' be fixed in pandas 2.

bluenote10 · 2017-12-12T16:19:58Z

It would probably be helpful to document the performance impact more clearly. This can have subtle side effects, which are very hard to find. I only noticed it, because a Dask/Distributed computation was much slower than expected (use case documented on SO)

jreback · 2017-12-12T16:28:51Z

and if u want to put up a PR would be happy to take it

TomAugspurger · 2019-09-10T14:06:35Z

Has this been fixed in the meantime? Running the script from the original post, I see

../foo.py:15: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[str(c)] = np.random.uniform(size=df.shape[0])
Runtime:      0.749 sec
Runtime:      0.668 sec

I'm using pandas master, Python 3.7 on MacOS.

TomAugspurger · 2019-09-10T14:08:30Z

#27031 seems to be the most likely fix. Thanks Jeff. (#27585 possibly helped, but that's less certain).

jreback · 2019-09-10T14:27:46Z

this should be ok now @TomAugspurger is that not what u r seeing?

TomAugspurger · 2019-09-10T14:34:14Z

I'm seeing that it's fixed now, just wanted to clarify since we had some Dask users reporting issues (but they're likely on an older pandas).

jreback closed this as completed Dec 12, 2017

jreback added Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 12, 2017

jreback added this to the won't fix milestone Dec 12, 2017

TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018

TomAugspurger mentioned this issue Sep 10, 2019

Workers stuck, increased memory usage while processing large CSV from S3. dask/distributed#1467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling chained assignment checks (SettingWithCopyWarning) can have huge performance impact #18743

Enabling chained assignment checks (SettingWithCopyWarning) can have huge performance impact #18743

bluenote10 commented Dec 12, 2017 •

edited

Loading

jreback commented Dec 12, 2017

bluenote10 commented Dec 12, 2017

jreback commented Dec 12, 2017

TomAugspurger commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

jreback commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

Enabling chained assignment checks (SettingWithCopyWarning) can have huge performance impact #18743

Enabling chained assignment checks (SettingWithCopyWarning) can have huge performance impact #18743

Comments

bluenote10 commented Dec 12, 2017 • edited Loading

Code Sample

Problem description

Output of pd.show_versions()

jreback commented Dec 12, 2017

bluenote10 commented Dec 12, 2017

jreback commented Dec 12, 2017

TomAugspurger commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

jreback commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

bluenote10 commented Dec 12, 2017 •

edited

Loading

Output of `pd.show_versions()`