-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pandas 2.2 read_csv(engine="c") leaks memory when code uses np.nan #57039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
reproducible on main (on wsl2) |
Result of a git bisect:
cc @WillAyd |
Thanks @rhshadrach that is very helpful. Will take a look |
Looks like LSAN would have caught this if we had it set up. Putting your mre into
You can see the rather drastic issue with the following steps: $ pip install -ve . --no-build-isolation --config-settings=builddir="asan" --config-settings=setup-args="-Db_sanitize=address" --config-settings=setup-args="-Dbuildtype=debug"
$ LSAN_OPTIONS=suppressions=<full_path_to_suppresssions>/lsan_suppress.txt LD_PRELOAD=$(gcc -print-file-name=libasan.so) python ascript.py
=================================================================
==371675==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 16000 byte(s) in 1000 object(s) allocated from:
#0 0x7ea581a94dd2 in __interceptor_realloc ../../../../libsanitizer/asan/asan_malloc_linux.cpp:85
#1 0x7ea55a2f508b in parser_trim_buffers ../pandas/_libs/src/parser/tokenizer.c:1207
#2 0x7ea54eae5640 in __pyx_pf_6pandas_5_libs_7parsers_10TextReader_12read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:29036
#3 0x7ea54eae2a9f in __pyx_pw_6pandas_5_libs_7parsers_10TextReader_13read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:28564
#4 0x7ea54eba11ae in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:57994
#5 0x64995d7a6a3b in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#6 0x64995d7a6a3b in PyObject_Vectorcall /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#7 0x64995d7a6a3b in call_function /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#8 0x64995d7a6a3b in _PyEval_EvalFrameDefault /usr/local/src/conda/python-3.10.13/Python/ceval.c:4198
Direct leak of 16000 byte(s) in 1000 object(s) allocated from:
#0 0x7ea581a94dd2 in __interceptor_realloc ../../../../libsanitizer/asan/asan_malloc_linux.cpp:85
#1 0x7ea55a2f55f6 in parser_trim_buffers ../pandas/_libs/src/parser/tokenizer.c:1258
#2 0x7ea54eae5640 in __pyx_pf_6pandas_5_libs_7parsers_10TextReader_12read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:29036
#3 0x7ea54eae2a9f in __pyx_pw_6pandas_5_libs_7parsers_10TextReader_13read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:28564
#4 0x7ea54eba11ae in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:57994
#5 0x64995d7a6a3b in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#6 0x64995d7a6a3b in PyObject_Vectorcall /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#7 0x64995d7a6a3b in call_function /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#8 0x64995d7a6a3b in _PyEval_EvalFrameDefault /usr/local/src/conda/python-3.10.13/Python/ceval.c:4198
Direct leak of 16000 byte(s) in 1000 object(s) allocated from:
#0 0x7ea581a94dd2 in __interceptor_realloc ../../../../libsanitizer/asan/asan_malloc_linux.cpp:85
#1 0x7ea55a2f511d in parser_trim_buffers ../pandas/_libs/src/parser/tokenizer.c:1211
#2 0x7ea54eae5640 in __pyx_pf_6pandas_5_libs_7parsers_10TextReader_12read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:29036
#3 0x7ea54eae2a9f in __pyx_pw_6pandas_5_libs_7parsers_10TextReader_13read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:28564
#4 0x7ea54eba11ae in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:57994
#5 0x64995d7a6a3b in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#6 0x64995d7a6a3b in PyObject_Vectorcall /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#7 0x64995d7a6a3b in call_function /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#8 0x64995d7a6a3b in _PyEval_EvalFrameDefault /usr/local/src/conda/python-3.10.13/Python/ceval.c:4198
Direct leak of 16000 byte(s) in 1000 object(s) allocated from:
#0 0x7ea581a94dd2 in __interceptor_realloc ../../../../libsanitizer/asan/asan_malloc_linux.cpp:85
#1 0x7ea55a2f555b in parser_trim_buffers ../pandas/_libs/src/parser/tokenizer.c:1252
#2 0x7ea54eae5640 in __pyx_pf_6pandas_5_libs_7parsers_10TextReader_12read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:29036
#3 0x7ea54eae2a9f in __pyx_pw_6pandas_5_libs_7parsers_10TextReader_13read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:28564
#4 0x7ea54eba11ae in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:57994
#5 0x64995d7a6a3b in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#6 0x64995d7a6a3b in PyObject_Vectorcall /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#7 0x64995d7a6a3b in call_function /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#8 0x64995d7a6a3b in _PyEval_EvalFrameDefault /usr/local/src/conda/python-3.10.13/Python/ceval.c:4198
Indirect leak of 129000 byte(s) in 1000 object(s) allocated from:
#0 0x7ea581a94dd2 in __interceptor_realloc ../../../../libsanitizer/asan/asan_malloc_linux.cpp:85
#1 0x7ea55a2f5254 in parser_trim_buffers ../pandas/_libs/src/parser/tokenizer.c:1226
#2 0x7ea54eae5640 in __pyx_pf_6pandas_5_libs_7parsers_10TextReader_12read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:29036
#3 0x7ea54eae2a9f in __pyx_pw_6pandas_5_libs_7parsers_10TextReader_13read_low_memory pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:28564
#4 0x7ea54eba11ae in __Pyx_CyFunction_Vectorcall_FASTCALL_KEYWORDS pandas/_libs/parsers.cpython-310-x86_64-linux-gnu.so.p/pandas/_libs/parsers.pyx.c:57994
#5 0x64995d7a6a3b in _PyObject_VectorcallTstate /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#6 0x64995d7a6a3b in PyObject_Vectorcall /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#7 0x64995d7a6a3b in call_function /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#8 0x64995d7a6a3b in _PyEval_EvalFrameDefault /usr/local/src/conda/python-3.10.13/Python/ceval.c:4198
-----------------------------------------------------
Suppressions used:
count bytes template
3040 4385516 obmalloc.c
214 642602 unicodeobject.c
370 171796 bytesobject.c
4 3224 gcmodule.c
4 128 PyThread
943 116014 arrow
60 1055 numpy
-----------------------------------------------------
SUMMARY: AddressSanitizer: 193000 byte(s) leaked in 5000 allocation(s). yikes! If you run those steps above with the fix in #57084 you get a clean result:
Would be great if we figure out how to get #54865 working one of these days...that would ideally catch this in CI |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I upgraded my env to pandas 2.2 and my previously working script that does thousands of
read_csv
calls failed with OOM. I trimmed everything down and to my amazement, found that the memory leak is triggered by the code usage ofnp.nan
. The above will use 3.2GB of RSS withnp.nan
used, 160 MB when not. I can't believe this is even possible :) Switching to engine="python" and there is no leak.I can reproduce this on a clean conda-forge install
mamba create -n repro python=3.11 numpy pandas
on linux64.Expected Behavior
No leak or small/manageable memory leaks with each
read_csv
call.Installed Versions
/opt/miniconda3/envs/prod/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
commit : f538741
python : 3.11.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-410.el9.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Jan 18 20:27:59 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.7
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.20.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2023.12.2
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2023.12.2
scipy : 1.12.0
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : 2024.1.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.4
qtpy : 2.4.1
pyqt5 : None
The text was updated successfully, but these errors were encountered: