Memory leak in `df.to_json` #24889

skatenerd · 2019-01-23T22:12:11Z

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                              
import numpy as np


df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

while True:
    body = df.T.to_json()
    print("HI")

Problem description

If we repeatedly call to_json() on a dataframe, memory usage grows continuously:

Expected Output

I would expect memory usage to stay constant

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]
/usr/local/lib/python3.6/dist-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: 0.25.2
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-01-24T13:13:06Z

Source is at https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/ujson/python/objToJSON.c is you're interested in debugging further.

chris-b1 · 2019-01-24T22:03:34Z

FWIW this seems to take a ton of iterations and doesn't really leak much memory, but investigations are welcome.

TomAugspurger · 2019-01-24T22:07:58Z

It's also worth isolating the to_json part from the `df.T` part.

…

On Thu, Jan 24, 2019 at 4:03 PM chris-b1 ***@***.***> wrote: FWIW this seems to take a ton of iterations and doesn't really leak much memory, but investigations are welcome. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24889 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIudiRAcurMKPP8cCqhjRO7t3FAzxks5vGi26gaJpZM4aPu6O> .

skatenerd · 2019-01-24T22:44:47Z

@TomAugspurger sorry, that was left over from some other original code. The leak happens with or without the transpose.

jbrockmendel · 2019-01-25T04:14:59Z

Does it also happen with other dtypes? How about an empty DataFrame?

skatenerd · 2019-01-25T15:58:30Z

Memory use is stable for an empty dataframe

skatenerd · 2019-01-25T16:01:43Z

In simple cases, there's an easy workaround. You can just use df.to_dict() and pass it to python's json.dump. You'll have to make some manual changes, such as manually serializing pandas timestamps.

WillAyd · 2019-04-27T06:26:10Z

Looked at this in more detail as I was reviewing the JSON integration. As mentioned above this issue does not affect empty dataframes, nor does it affect non-numeric frames.

This leaks:

pd.DataFrame([[1]]).to_json()

But this doesn't

pd.DataFrame([['a']])

I believe I've isolated the issue to this line of code:

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 716 in 65466f0

Py_INCREF(obj);

Removing that prevented the example from the OP from leaking. Unfortunately it caused segfaults in the rest of the code base so have to debug that further, but will push a PR if I can figure it out.

@jbrockmendel you might have some thoughts here as well. You called this condition out as looking wonky before and it does seem to cause the leak here, just need to figure out what the actual intent is

LuisLuii · 2020-10-06T02:24:14Z

hi, i am facing the same issue about memory leak in df.to_json().

Here I am using df.to_dict() and pass it to python's json.dump, the memory use is stable

But when I use the df.to_json()

Code Sample

import json

import pandas as pd


def list_to_df_json( data):
        point_classified = {}
        for i in data:
                if i['point_id'] not in point_classified:
                        point_classified[i['point_id']] = {}
                point_classified[i['point_id']][i['timestamp']] = i['point_value']
        return point_classified


def boo(a):

        data = list_to_df_json(a)

        for point_id, point_value_of_that_id in data.items():
                # logging.info(f"pushing data from pointid : {point_id} ")
                df = pd.DataFrame.from_dict(point_value_of_that_id, orient='index', columns=[point_id])
                
                # dict_df = df.to_dict(orient='index')
                # workaround
                # json_df = json.dumps(dict_df)
                
                # memory leak
                json_df = df.to_json(orient='index')
        return json_df

while 1:
      a = [{'point_id': 'a',
              'point_value': 346.9,
              'timestamp': '2019-12-01 08:15:00'},
             {'point_id': 'a',
              'point_value': 247.2,
              'timestamp': '2019-12-01 08:30:00'},
             {'point_id': 'a',
              'point_value': 237.9,
              'timestamp': '2019-12-01 08:45:00'},
             {'point_id': 'a',
              'point_value': 215.2,
              'timestamp': '2019-12-01 09:00:00'},
             {'point_id': 'b',
              'point_value': 276.8,
              'timestamp': '2019-12-01 09:15:00'},
             {'point_id': 'b',
              'point_value': 296.1,
              'timestamp': '2019-12-01 09:30:00'},
             {'point_id': 'b',
              'point_value': 328.0,
              'timestamp': '2019-12-01 09:45:00'}]

        print(boo(a))
        # pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.10.final.0
python-bits: 64
OS: Darwin
OS-release: 19.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 5.4.3
pip: 19.3.1
setuptools: 44.0.0.post20200106
Cython: None
numpy: 1.19.1
scipy: 1.5.2
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2019.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.3.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.5.0
bs4: 4.8.2
html5lib: None
sqlalchemy: 1.3.18
pymysql: 0.9.3
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

ohinds · 2020-11-17T20:51:49Z

I'm still seeing this memory leak on to_json() as of 1.1.4.

A slightly modified version of the original reproduction script:

import pandas as pd
import numpy as np
import memory_utils

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

while True:
    # leak
    body = df.to_json()

    # no leak
    # body = df.to_dict()

    memory_utils.print_memory('to_json leak test')

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.9.8-arch1-1
Version          : #1 SMP PREEMPT Tue, 10 Nov 2020 22:44:11 +0000
machine          : x86_64
processor        :
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

mikicz · 2021-03-15T15:53:56Z

I'm also seeing this, using modified @ohinds's code:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

for i in range(1000000):
    # leak
    # body = df.to_json()

    # no leak
    body = df.to_dict()

With running to_json():

With running to_dict():

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.10.21-200.fc33.x86_64
Version          : #1 SMP Mon Mar 8 00:24:40 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.1.3
Cython           : 0.29.22
pytest           : 6.2.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.2
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

klichukb · 2021-03-27T11:51:03Z

Can confirm this as well, in pandas 1.2.3, numeric column. Dumping dataframe is fine, dumping a series of numerics is leaking.

fcs69 · 2021-11-04T11:42:51Z

Can confirm same behaviour with python 3.8.7 / pandas 1.3.4 on Windows 10(64)

arturzangiev · 2021-11-19T19:00:04Z

Still a problem Mac OS, python 3.8, pandas 1.0.5

saeedghadiri · 2022-07-11T07:36:24Z

Still a problem on Windows, python 3.9 / pandas 1.2.3

saeedghadiri · 2022-07-11T07:54:21Z

Upgrading to pandas 1.4.3 fixes the problem.

chris-b1 added Performance Memory or execution speed performance IO JSON read_json, to_json, json_normalize labels Jan 24, 2019

chris-b1 added this to the Contributions Welcome milestone Jan 24, 2019

WillAyd mentioned this issue Apr 29, 2019

Fix Memory Leak in to_json with Numeric Values #26239

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.25.0 Apr 30, 2019

jreback closed this as completed in #26239 Apr 30, 2019

lisabang mentioned this issue Jun 28, 2019

networkx takes forever to generate network stats in server.py LinkageIO/cob#89

Open

patrickeganfoley mentioned this issue Mar 11, 2021

Memory leak on to_json? #26347

Closed

mikicz mentioned this issue Mar 16, 2021

BUG: Memory leak in json encoding for time related objects #40443

Closed

3 tasks

klichukb mentioned this issue Mar 27, 2021

[RAPTOR-4976] Downgrade pandas version to workaround pandas to_json() memory leak datarobot/datarobot-user-models#305

Merged

wizrds mentioned this issue Nov 17, 2022

Fix Pandas to_json memory leak freqtrade/freqtrade#7759

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in `df.to_json` #24889

Memory leak in `df.to_json` #24889

skatenerd commented Jan 23, 2019

INSTALLED VERSIONS

TomAugspurger commented Jan 24, 2019

chris-b1 commented Jan 24, 2019

TomAugspurger commented Jan 24, 2019 via email

skatenerd commented Jan 24, 2019

jbrockmendel commented Jan 25, 2019

skatenerd commented Jan 25, 2019

skatenerd commented Jan 25, 2019 •

edited

Loading

WillAyd commented Apr 27, 2019

LuisLuii commented Oct 6, 2020 •

edited

Loading

ohinds commented Nov 17, 2020

mikicz commented Mar 15, 2021

klichukb commented Mar 27, 2021

fcs69 commented Nov 4, 2021

arturzangiev commented Nov 19, 2021

saeedghadiri commented Jul 11, 2022

saeedghadiri commented Jul 11, 2022

Memory leak in df.to_json #24889

Memory leak in df.to_json #24889

Comments

skatenerd commented Jan 23, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jan 24, 2019

chris-b1 commented Jan 24, 2019

TomAugspurger commented Jan 24, 2019 via email

skatenerd commented Jan 24, 2019

jbrockmendel commented Jan 25, 2019

skatenerd commented Jan 25, 2019

skatenerd commented Jan 25, 2019 • edited Loading

WillAyd commented Apr 27, 2019

LuisLuii commented Oct 6, 2020 • edited Loading

INSTALLED VERSIONS

ohinds commented Nov 17, 2020

mikicz commented Mar 15, 2021

klichukb commented Mar 27, 2021

fcs69 commented Nov 4, 2021

arturzangiev commented Nov 19, 2021

saeedghadiri commented Jul 11, 2022

saeedghadiri commented Jul 11, 2022

Memory leak in `df.to_json` #24889

Memory leak in `df.to_json` #24889

Output of `pd.show_versions()`

skatenerd commented Jan 25, 2019 •

edited

Loading

LuisLuii commented Oct 6, 2020 •

edited

Loading