Skip to content

pd.read_json() leads to a segmentation fault #32383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gambingo opened this issue Mar 1, 2020 · 7 comments · Fixed by #33345
Closed

pd.read_json() leads to a segmentation fault #32383

gambingo opened this issue Mar 1, 2020 · 7 comments · Fixed by #33345
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@gambingo
Copy link

gambingo commented Mar 1, 2020

Code Sample

The following code leads to either a segmentation fault or a bus error:

def load_dataframe(name):
    ...
    df_json = db.dataframes.find_one({"_id": name})["df"]
    df = pd.read_json(df_json)
    return df

But the following workaround fixes it:

def load_dataframe(name):
    ...
    df_json = db.dataframes.find_one({"_id": name})["df"]
    df = pd.DataFrame().from_dict(json.loads(df_json))
    return df

Problem description

I am storing several jsonified dataframes in MongoDB. Trying to go straight from the json to the dataframe leads to either a segmentation fault or a bus error. I am storying several dataframes, but confusingly it only happens with one dataframe and not the others. The problem dataframe is identical in nature to other dataframes (same column names and data types). Unfortunately, it's not shareable but happy to answer any questions I can.

Output of pd.show_versions()

Note: I am using the latest pandas, 1.0.1. I downgraded to pandas 0.25.3 and had the same issue.

INSTALLED VERSIONS ------------------ commit : None python : 3.7.0.final.0 python-bits : 64 OS : Darwin OS-release : 17.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@gambingo gambingo changed the title pd.read_json() create segementation fault pd.read_json() creates segementation fault Mar 1, 2020
@gambingo gambingo changed the title pd.read_json() creates segementation fault pd.read_json() leads to a segmentation fault Mar 1, 2020
@gambingo gambingo changed the title pd.read_json() leads to a segmentation fault pd.read_json() leads to a segmentation fault Mar 1, 2020
@jbrockmendel jbrockmendel added Segfault Non-Recoverable Error IO JSON read_json, to_json, json_normalize labels Mar 1, 2020
@jbrockmendel
Copy link
Member

Can you post a copy/paste-able example (https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports)?

@mroeschke mroeschke added the Needs Info Clarification about behavior needed to assess issue label Mar 1, 2020
@gambingo
Copy link
Author

gambingo commented Mar 2, 2020

Yes I can! Apologies for not doing so sooner. Thanks for the helpful MCVE post.

This leads to a segmentation fault on my machine:

import pandas as pd

df = pd.DataFrame(
        {'_id': {'0': '0'},
        'category': {'0': 'Goods'},
        'recommender_id': {'0': '3'},
        'recommender_name_jp': {'0': '浦田'},
        'recommender_name_en': {'0': 'Urata'},
        'name_jp': {'0': '博多人形(松尾吉将まつお よしまさ)'},
        'name_en': {'0': 'Hakata Dolls Matsuo'}
        })

df_json = df.to_json()
pd.read_json(df_json)

But this does not:

import json
import pandas as pd

df = pd.DataFrame(
        {'_id': {'0': '0'},
        'category': {'0': 'Goods'},
        'recommender_id': {'0': '3'},
        'recommender_name_jp': {'0': '浦田'},
        'recommender_name_en': {'0': 'Urata'},
        'name_jp': {'0': '博多人形(松尾吉将まつお よしまさ)'},
        'name_en': {'0': 'Hakata Dolls Matsuo'}
        })

df_json = df.to_json()
df_dict = json.loads(df_json)
pd.DataFrame().from_dict(df_dict)

And, unfortunately, there is no helpful traceback. All I get is a Segmentation fault: 11 or sometimes a Bus error: 10.

@gambingo gambingo closed this as completed Mar 2, 2020
@gambingo gambingo reopened this Mar 2, 2020
@jbrockmendel
Copy link
Member

Hmm i cant reproduce this locally, so let's try to narrow down the example. Do you need all of the entries in that dict to get the segfault, or can you remove some? (id speculate that the ASCII entries can be removed)

@TomAugspurger
Copy link
Contributor

I also can't reproduce using the code in #32383 (comment) (using pandas master).

@TomAugspurger
Copy link
Contributor

@gambingo can you try on pandas master?

@mroeschke
Copy link
Member

Cant produce this either on master with OSX. Something possible might have been fixed on master in the meantime. Suppose it could use a test

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue Segfault Non-Recoverable Error labels Apr 4, 2020
@BenjaminLiuPenrose
Copy link
Contributor

cannot reproduce this for pandas master and v1.0.1 and v0.25.3 in Windows environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
6 participants