BUG: Memory leak in json encoding for time related objects #40443

Dollars · 2021-03-15T10:29:23Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Hello,

While using pandas in my project, I saw the memory usage of my process raising. After some digging, it looks like there is a memory leak in the JSON encoding code.

This simple test should be able to reproduce the issue:

import pandas as pd

if __name__ == "__main__":
    df = pd.DataFrame([[1,2], [1,2], [3,4], [3,4], [3,6], [5,6], [7,8], [7,8]], columns=["a", "b"], index = pd.date_range('1/1/2000', periods=8, freq='T'))
    for i in range(10000):
        str = df.to_json(orient="table")
        df = pd.read_json(str, orient="table")

Which ran using Valgrind should show that kind of result:

$ PYTHONMALLOC=malloc valgrind --leak-check=yes --track-origins=yes --log-file=valgrind-log.txt python test.py
...
==214631== 3,358,152 bytes in 79,956 blocks are definitely lost in loss record 15,015 of 15,015
==214631==    at 0x483E77F: malloc (vg_replace_malloc.c:307)
==214631==    by 0x4F811482: int64ToIso (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4F81364B: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4F813514: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4F8135F7: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4F813514: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4F813A80: JSON_EncodeObject (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4F811119: objToJSON (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631==    by 0x4993B18: _PyMethodDef_RawFastCallKeywords (in /usr/lib/libpython3.7m.so.1.0)
==214631==    by 0x4993713: _PyCFunction_FastCallKeywords (in /usr/lib/libpython3.7m.so.1.0)
==214631==    by 0x499364C: ??? (in /usr/lib/libpython3.7m.so.1.0)
==214631==    by 0x498E0CD: _PyEval_EvalFrameDefault (in /usr/lib/libpython3.7m.so.1.0)
...

Which points to the int64ToIso() function in this case but mostly any function used in the getStringValue() function is allocating memory and this memory appears to not be freed after that (if I'm not missing something).

pandas/pandas/_libs/src/ujson/lib/ultrajsonenc.c

Line 1082 in 2d51ebb

value = enc->getStringValue(obj, &tc, &szlen);

It would be great if someone can confirm my deduction. If I'm right, I will try to submit a PR.

Thanks.

The environment I used:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 7d32926db8f7541c356066dcadabf854487738de
python           : 3.7.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.11.2-1-MANJARO
Version          : #1 SMP PREEMPT Fri Feb 26 12:17:53 UTC 2021
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.2
numpy            : 1.19.5

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-03-15T19:21:37Z

cc @WillAyd

WillAyd · 2021-03-15T21:52:46Z

Wow thanks for the details - very nice debugging. At an initial glance I think you are correct. We might need to make those date time functions look similar to Object_getBigNumStringValue where they set the cStr struct member (which I think is freed when the context ends) rather than just returning the object that was allocated

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 1989 in 7af47c9

const char *Object_getBigNumStringValue(JSOBJ obj, JSONTypeContext *tc,

Dollars · 2021-03-16T08:03:38Z

Thanks for the reply.

I think there is still an issue there. All the function linked to the JT_UTF8 type (linked to calling getStringValue) do not have the same expectations about the handling of the returned string.

As an example, PyBytesToUTF8 which calls PyBytes_AS_STRING, a macro of PyBytes_AsString.

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Lines 1560 to 1561 in 7af47c9

    
           pc->PyTypeToUTF8 = PyBytesToUTF8; 
        
           tc->type = JT_UTF8;

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Lines 321 to 326 in 3b4aec2

    
           static char *PyBytesToUTF8(JSOBJ _obj, JSONTypeContext *Py_UNUSED(tc), 
        
                                      size_t *_outLen) { 
        
               PyObject *obj = (PyObject *)_obj; 
        
               *_outLen = PyBytes_GET_SIZE(obj); 
        
               return PyBytes_AS_STRING(obj); 
        
           }

As specified by the cpython documentation:

It must not be deallocated

On the contrary for the function int64ToIso the returned string must be freed by the caller.

May be, we could use a mustFree flag in the TypeContext struct or just use split the type to JT_UTF8_FREE and JT_UTF8_NOFREE something like that.

Dollars · 2021-03-16T08:20:57Z

Oh sorry, now that I have read the code one more time, I understand why using the cStr struct member would be a good solution. Indeed.

mikicz · 2021-03-16T15:15:04Z

Hi, I'm gonna piggyback on this issue a bit to mention that I'm pretty sure it's not just time related objects, see my comment here on a closed issue from quite a while ago (I wasn't sure if to create a new ticket): #24889 (comment)

There I used a DF which didn't have any time data, just numeric columns and the to_json() was leaking memory as well:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

for i in range(1000000):
    # leak
    body = df.to_json()

memory-profiler output:

Dollars · 2021-03-17T10:41:23Z

@mikicz: I have ran your small test script using Valgrind and it appears the memory leak comes from numpy.

PYTHONMALLOC=malloc valgrind --leak-check=yes --track-origins=yes --log-file=valgrind-log.txt python test.py
...
==72889== 191,989,728 (159,991,680 direct, 31,998,048 indirect) bytes in 1,999,896 blocks are definitely lost in loss record 14,834 of 14,834
==72889==    at 0x483E77F: malloc (vg_replace_malloc.c:307)
==72889==    by 0x6DF7611: array_alloc (in /usr/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so)
==72889==    by 0x6E4467C: PyArray_NewFromDescr_int (in  /usr/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so)
==72889==    by 0x6E3E52F: PyArray_View (in  /usr/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so)
==72889==    by 0x6EDBF82: array_view (in  /usr/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so)
==72889==    by 0x4993ABF: _PyMethodDef_RawFastCallKeywords (in /usr/lib/libpython3.7m.so.1.0)
==72889==    by 0x499B37E: _PyMethodDescr_FastCallKeywords (in /usr/lib/libpython3.7m.so.1.0)
==72889==    by 0x49935F4: ??? (in /usr/lib/libpython3.7m.so.1.0)
==72889==    by 0x498D3C6: _PyEval_EvalFrameDefault (in /usr/lib/libpython3.7m.so.1.0)
==72889==    by 0x499A88A: ??? (in /usr/lib/libpython3.7m.so.1.0)
==72889==    by 0x49A18CB: ??? (in /usr/lib/libpython3.7m.so.1.0)
==72889==    by 0x4999A5C: _PyObject_GenericGetAttrWithDict (in /usr/lib/libpython3.7m.so.1.0)
==72889== 
==72889== LEAK SUMMARY:
==72889==    definitely lost: 159,993,568 bytes in 1,999,916 blocks
==72889==    indirectly lost: 31,998,729 bytes in 1,999,887 blocks
...

It's clear both those issues do not have the same cause.

mikicz · 2021-03-17T10:49:33Z

Thanks for doing that! Do you think I should report this to numpy directly or make a new issue here? I don't really understand pandas internals, so am not sure if the issue is in numpy directly or in the usage of numpy here?

jmg-duarte · 2022-09-12T08:27:18Z

I know I'm coming late to the party but I would like to recommend memray. I would've never be able to detect this one python-lz4/python-lz4#247 without it.
If no one is interested, I can try to pick up where @Dollars left off

Dollars · 2022-09-12T09:03:18Z

Hello, indeed I have not done much about this issue at the moment. It is not always easy to find time 😅. The profiler memray looks very nice.
For me you are welcome to push this forward. I can review it if you ask me to.

Dollars changed the title ~~Memory leak in json encoding code~~ [BUG] Memory leak in json encoding code Mar 15, 2021

Dollars changed the title ~~[BUG] Memory leak in json encoding code~~ BUG: Memory leak in json encoding code Mar 15, 2021

Dollars changed the title ~~BUG: Memory leak in json encoding code~~ BUG: Memory leak in json encoding for time related objects Mar 15, 2021

WillAyd added the IO JSON read_json, to_json, json_normalize label Mar 15, 2021

lithomas1 added the Performance Memory or execution speed performance label Mar 22, 2021

wizrds mentioned this issue Nov 17, 2022

Fix Pandas to_json memory leak freqtrade/freqtrade#7759

Merged

xmatthias mentioned this issue Nov 30, 2022

Memory leak? How to further diagnose? freqtrade/freqtrade#7826

Closed

lithomas1 self-assigned this Jan 17, 2023

lithomas1 mentioned this issue Jan 17, 2023

BUG: Fix memory leak in to_json for time objects #50794

Merged

5 tasks

mroeschke closed this as completed in #50794 Jan 18, 2023

lithomas1 added this to the 2.0 milestone Jan 18, 2023

xmatthias mentioned this issue May 22, 2023

Revert DataFrame serialization in Producer mode (>pandas 2.0) freqtrade/freqtrade#8004

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Memory leak in json encoding for time related objects #40443

BUG: Memory leak in json encoding for time related objects #40443

Dollars commented Mar 15, 2021 •

edited

Loading

jbrockmendel commented Mar 15, 2021

WillAyd commented Mar 15, 2021 •

edited

Loading

Dollars commented Mar 16, 2021 •

edited

Loading

Dollars commented Mar 16, 2021

mikicz commented Mar 16, 2021

Dollars commented Mar 17, 2021

mikicz commented Mar 17, 2021

jmg-duarte commented Sep 12, 2022

Dollars commented Sep 12, 2022

BUG: Memory leak in json encoding for time related objects #40443

BUG: Memory leak in json encoding for time related objects #40443

Comments

Dollars commented Mar 15, 2021 • edited Loading

jbrockmendel commented Mar 15, 2021

WillAyd commented Mar 15, 2021 • edited Loading

Dollars commented Mar 16, 2021 • edited Loading

Dollars commented Mar 16, 2021

mikicz commented Mar 16, 2021

Dollars commented Mar 17, 2021

mikicz commented Mar 17, 2021

jmg-duarte commented Sep 12, 2022

Dollars commented Sep 12, 2022

Dollars commented Mar 15, 2021 •

edited

Loading

WillAyd commented Mar 15, 2021 •

edited

Loading

Dollars commented Mar 16, 2021 •

edited

Loading