PERF: Memory leak with pandas.Series.str.replace() #45277

ErikinBC · 2022-01-08T22:24:14Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

I first noticed a memory leak when I needed to do a for loop in a model class and I was applying str.replace for a Pandas Series. The code below shows that when using np.vectorize on re.sub there is no problem, whereas doing to the same thing with str.replace causes the memory to grow linearly over the for loop.

import re
import os
import psutil
import string
import numpy as np
import pandas as pd

def all_mem():
    process = psutil.Process(os.getpid())
    mi = process.memory_info()
    return sum([z for z in mi])

def gsub(string, pat, rep):
    return re.sub(pat, rep, string)

str_replace = np.vectorize(gsub, excluded=['pat','rep'])

words = pd.Series(np.repeat(string.ascii_lowercase,10000))

# (i) No memory leak with np.vectorize
mem_start = all_mem()
for i in range(1000):
    if (i+1) % 100 == 0:
        mem_pct = 100*(all_mem()/mem_start - 1)
        print('Iteration %i, mem: %0.1f%%' % (i+1, 100*mem_pct))
    n = len(str_replace(words, '[^abc]', ''))

Iteration 100, mem: 9.0%
Iteration 200, mem: 9.0%
Iteration 300, mem: 9.0%
Iteration 400, mem: 9.0%
Iteration 500, mem: 9.0%
Iteration 600, mem: 9.0%
Iteration 700, mem: 9.0%
Iteration 800, mem: 9.0%
Iteration 900, mem: 9.0%
Iteration 1000, mem: 9.0%

# (ii) Memory leak pd.Series.str.replace
mem_start = all_mem()
for i in range(1000):
    if (i+1) % 100 == 0:
        mem_pct = 100*(all_mem()/mem_start - 1)
        print('Iteration %i, mem: %0.1f%%' % (i+1, 100*mem_pct))
    n = len(words.str.replace('[^abc]','',regex=True))

Iteration 100, mem: 291.2%
Iteration 200, mem: 590.3%
Iteration 300, mem: 888.5%
Iteration 400, mem: 1186.7%
Iteration 500, mem: 1485.7%
Iteration 600, mem: 1784.0%
Iteration 700, mem: 2083.0%
Iteration 800, mem: 2381.2%
Iteration 900, mem: 2679.5%
Iteration 1000, mem: 2978.5%

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.10.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.60.1-microsoft-standard-WSL2
Version : #1 SMP Wed Aug 25 23:20:18 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.22.0
pytz : 2021.3
dateutil : 2.8.2
setuptools : 60.3.1
pip : 21.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-01-08T22:41:47Z

#41357 may be related

mroeschke · 2022-01-09T00:46:34Z

Actually could you try on the newly released 1.4.0rc. I don't see the same behavior on master

In [1]: import re
   ...: import os
   ...: import psutil
   ...: import string
   ...: import numpy as np
   ...: import pandas as pd
   ...:
   ...: def all_mem():
   ...:     process = psutil.Process(os.getpid())
   ...:     mi = process.memory_info()
   ...:     return sum([z for z in mi])
   ...:

In [2]: words = pd.Series(np.repeat(string.ascii_lowercase,10000))

In [3]: mem_start = all_mem()
   ...: for i in range(1000):
   ...:     if (i+1) % 100 == 0:
   ...:         mem_pct = 100*(all_mem()/mem_start - 1)
   ...:         print('Iteration %i, mem: %0.1f%%' % (i+1, 100*mem_pct))
   ...:     n = len(words.str.replace('[^abc]','',regex=True))
   ...:
Iteration 100, mem: 3.1%
Iteration 200, mem: 3.1%
Iteration 300, mem: 3.1%
Iteration 400, mem: 3.1%
Iteration 500, mem: 3.1%
Iteration 600, mem: 3.1%
Iteration 700, mem: 3.1%
Iteration 800, mem: 3.1%
Iteration 900, mem: 3.1%
Iteration 1000, mem: 3.1%

ErikinBC · 2022-01-09T20:41:31Z

How do install version 1.4? Anaconda has up to version 1.3.5.

mroeschke · 2022-01-09T20:46:21Z

Ah sorry it's an rc version right now. pip install pandas==1.4.0rc0

mroeschke · 2022-03-16T03:41:31Z

Thanks for the issue, but since this looks fixed in the latest version, closing.

ErikinBC added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 8, 2022

mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2022

mroeschke closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Memory leak with pandas.Series.str.replace() #45277

PERF: Memory leak with pandas.Series.str.replace() #45277

ErikinBC commented Jan 8, 2022

mroeschke commented Jan 8, 2022

mroeschke commented Jan 9, 2022

ErikinBC commented Jan 9, 2022

mroeschke commented Jan 9, 2022

mroeschke commented Mar 16, 2022

PERF: Memory leak with pandas.Series.str.replace() #45277

PERF: Memory leak with pandas.Series.str.replace() #45277

Comments

ErikinBC commented Jan 8, 2022

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

mroeschke commented Jan 8, 2022

mroeschke commented Jan 9, 2022

ErikinBC commented Jan 9, 2022

mroeschke commented Jan 9, 2022

mroeschke commented Mar 16, 2022