-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Memory leak with .rolling().max() in pandas 0.24.2 #25893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you profile and try to isolate the issue? |
import os
import psutil
import numpy as np
import pandas as pd
process = psutil.Process(os.getpid())
i = 0
def sink():
global i
i += 1
if i % 100 == 0:
mem = process.memory_info().rss / 1e6
print("mem %fMB" % mem)
while True:
pd.Series(data=np.random.rand(5000)).rolling(4000).max()
sink() I was wrong, it has nothing to do with the dataset I provided. This is a pretty huge bug in pandas imo. |
Thanks for the code sample. Investigation and PRs are always welcome! |
The bug does not occur with rolling().mean(), but it does with rolling().min(). |
It does look like the min/max implementation is the only window func with calls to malloc/free: pandas/pandas/_libs/window.pyx Line 1372 in ac318d2
Per Cython docs it may be preferable to use the C-API functions for better memory management and reporting back to the Python layer: https://cython.readthedocs.io/en/latest/src/tutorial/memory_allocation.html#memory-allocation Would take a PR trying that or other ideas for sure |
May also be that the GIL is released when free is called in current implementation |
After some testing, it doesn't seem to have anything to do with the GIL or using the C-API vs direct malloc/free calls. It looks like this bug was introduced when separating variable/fixed into separate functions. For some reason, passing some of the cdef variables to separate functions (likely A simple fix is to just combine the functions into one large function again, but it may be better to work out an alternative that doesn't cause the leak, like using pointers or perhaps a memory view. I can take a shot at a proper fix, but it may be a couple of days until I have a PR. |
Sounds good thanks for investigating @ArtificialQualia ! |
Code Sample, a copy-pastable example if possible
Problem description
Memory leak which shuts down my application. This occurs in pandas 0.24.2 but not in pandas 0.23.4. My 16 GB memory gets filled in a few hours of running this code. file.csv is attached in zip, the memory leak might only occur on certain data.
file.csv.zip
Expected Output
No memory leaks.
Output of
pd.show_versions()
pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 38.4.0
Cython: None
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: