BUG: Segfault when calling df.to_json(orient='records') and numpy.datetime64 being serialized #58160

paddymul · 2024-04-05T17:24:41Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Reproducing this bug has been very difficult.  I have a test script that runs off of my library repo that gets it to fail, but when trying to make a reduced testcase it doesn't fail.  Given that i'm filing this against an old version of pandas, I'll just describe the bug to start.  I'm happy to work with you to produce a reduced test case.

This is the code that should cause a segfault

ts = pd.Timestamp('2020-02-21T00:00:00.000000000')
np_dt = pd.Series([ts]).mode().values[0]
sdf = pd.DataFrame({'a':[ts], 'b':[np_dt]}, index=['mode'])
sdf.to_json(orient='table', indent=2, default_handler=str)


My actual code in the library (with print instrumentation)
```python
def pd_to_obj(df:pd.DataFrame):
    print("--------------")
    print("here ser 86")

    if len(df) > 4:
        df = df[20:21]
        print(df)
        val = df.iloc[0][0]
        print("val", val)
        print("type(val)", type(val))

    print("df.index", df.index)
    print("len df", len(df))
    serialized = df.to_json(orient='table')
    print("here ser 88")

this fails at the df.to_json call. Here is the debugging info outputted

--------------
here ser 86
                          EventDate
mode  2020-02-21T00:00:00.000000000
val 2020-02-21T00:00:00.000000000
type(val) <class 'numpy.datetime64'>
df.index Index(['mode'], dtype='object')
len df 1
zsh: segmentation fault  python crash_buckaroo.py



### Issue Description

Pandas 2.0.3 segafaults when executing `to_json(orient='records')`

### Expected Behavior

There shouldn't be a segfault

### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.10.14.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.6.0
Version          : Darwin Kernel Version 22.6.0: Wed Oct  4 21:25:40 PDT 2023; root:xnu-8796.141.3.701.17~4/RELEASE_ARM64_T8103
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.26.4
pytz             : 2024.1
dateutil         : 2.9.0.post0
setuptools       : 68.2.2
pip              : 23.3.1
Cython           : None
pytest           : 8.1.1
hypothesis       : None
sphinx           : 7.2.6
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.3
IPython          : 8.23.0
pandas_datareader: None
bs4              : 4.12.3
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.8.4
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.13.0
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2024.1
qtpy             : None
pyqt5            : None

</details>

The text was updated successfully, but these errors were encountered:

aabhinavg · 2024-04-08T16:51:38Z

System Information:

OS: macOS-13.5-arm64-arm-64bit
Architecture: ('64bit', '')
Machine: arm64
Processor: arm
Byte order: little
Locale Information:
LC_ALL: (None, 'UTF-8')
LANG: (None, 'UTF-8')
Package Versions:
pandas: 3.0.0.dev0+680.gb8a4691647
numpy: 1.24.2
pytz: 2023.3
dateutil: 2.8.2
setuptools: 58.0.4
pip: 24.0
pytest: 8.1.1
IPython: 8.18.1
matplotlib: 3.6.3
scipy: 1.10.1

TestCase I used
import pandas as pd

Create a Pandas DataFrame

ts = pd.Timestamp('2020-02-21T00:00:00.000000000')
np_dt = pd.Series([ts]).mode().values[0]
sdf = pd.DataFrame({'a': [ts], 'b': [np_dt]}, index=['mode'])

Print the DataFrame

print("DataFrame:")
print(sdf)

Save the DataFrame as a JSON file

json_filename = "data.json"
sdf.to_json(json_filename, orient='table', indent=2, default_handler=str)
print(f"DataFrame saved as JSON file: {json_filename}")

python3 58160.py
Output:
DataFrame:
a b
mode 2020-02-21 2020-02-21
DataFrame saved as JSON file: data.json

data.json content
{
"schema":{
"fields":[
{
"name":"index",
"type":"string"
},
{
"name":"a",
"type":"datetime"
},
{
"name":"b",
"type":"datetime"
}
],
"primaryKey":[
"index"
],
"pandas_version":"1.4.0"
},
"data":[
{
"index":"mode",
"a":"2020-02-21T00:00:00.000",
"b":"2020-02-21T00:00:00.000"
}
]
}
I am not able to see any problem with the above-mentioned configuration can you try it.

paddymul · 2024-04-09T17:48:08Z

As I said in the bug. The approximate reproduction code doesn't cause a segfault. I also included the code in my application next to where the segfault happens, along with debugging print statements. I think I'm constructing a dataframe with the same values as the failing one, but the failing dataframe is the result of a series of summary functions being applied to a dataframe and all of the summary values are constructed into a single dataframe. The memory layout for the failing dataframe is probably different than the value equivalent dataframe. I don't know how to succinctly reproduce the failing dataframe.

I'm happy to dig into this problem with whoever is interested. You might get an interesting test case out of it.

paddymul · 2024-04-19T11:53:38Z

I created a pickle file with a simplified dataframe that can reproduce the segmentation fault simplified_df.pckl

the value_counts' row is the problem. The value for EventDateis the result of callingdatetime_series.value_counts()` so I think it's a dataframe.

Here is the test script

import pandas as pd
simplified_df = pd.read_pickle('simplified_df.pckl')
pd.show_versions()
print("simplified_df")
print(simplified_df)
print("-" * 80)
print("simplified_df.dtypes")
print(simplified_df.dtypes)
print("-" * 80)
vc_bad_row = simplified_df['EventDate'].loc['value_counts']
print("vc_bad_row")
print(vc_bad_row)
print("-" * 80)
print("type(vc_bad_row)")
print(type(vc_bad_row))
print("-" * 80)
print("vc_bad_row.dtype")
print(vc_bad_row.dtype)
print("-" * 80)
print("before to_json")
simplified_df.to_json(orient='table')
print("after  to_json")

which outputs the following on my machine

python crash_buckaroo_simple.py 
python crash_buckaroo_simple.py 
/Users/paddy/anaconda3/envs/buckaroo-dev-6/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 965ceca9fd796940050d6fc817707bba1c4f9bff
python           : 3.8.18.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.6.0
Version          : Darwin Kernel Version 22.6.0: Wed Oct  4 21:25:40 PDT 2023; root:xnu-8796.141.3.701.17~4/RELEASE_ARM64_T8103
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.2
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 68.2.2
pip              : 23.3.1
Cython           : None
pytest           : 7.3.2
hypothesis       : None
sphinx           : 7.1.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.12.2
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : 1.3.5
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.7.5
numba            : None
numexpr          : 2.8.4
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 14.0.2
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.1
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None
simplified_df
                                         index                                          EventDate
length                                       1                                                  1
min                                          0                                                NaN
max                                          0                                                NaN
mean                                       0.0                                                  0
nan_count                                    0                                                  0
value_counts  0    1
Name: count, dtype: int64  EventDate
2020-02-21    1
Name: count, dtype: ...
mode                                         0                      2020-02-21T00:00:00.000000000
--------------------------------------------------------------------------------
simplified_df.dtypes
index        object
EventDate    object
dtype: object
--------------------------------------------------------------------------------
vc_bad_row
EventDate
2020-02-21    1
Name: count, dtype: int64
--------------------------------------------------------------------------------
type(vc_bad_row)
<class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------
vc_bad_row.dtype
int64
--------------------------------------------------------------------------------
before to_json
zsh: segmentation fault  python crash_buckaroo_simple.py

paddymul · 2024-05-25T15:05:58Z

I now have a minimally reproducible example

import pandas as pd
import numpy as np
bad_df = pd.DataFrame({'column': { 'foo':8, 'bad_val': np.datetime64('2005-02-25')}})
bad_df.to_json(orient='table')

mroeschke · 2024-07-08T17:57:11Z

This looks to be fixed on main, so it appears to have been addressed since 2.0.3 so closing

In [1]: import pandas as pd
   ...: import numpy as np
   ...: bad_df = pd.DataFrame({'column': { 'foo':8, 'bad_val': np.datetime64('2005-02-25')}})
   ...: bad_df.to_json(orient='table')
Out[1]: '{"schema":{"fields":[{"name":"index","type":"string"},{"name":"column","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":"foo","column":8},{"index":"bad_val","column":"2005-02-25T00:00:00.000"}]}'

paddymul added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2024

This was referenced Apr 5, 2024

API/BUG: Make to_json index= arg consistent with orient arg #52143

Merged

Buckaroo crashes Python interpreter when DataFrames include pandas.Timestamp paddymul/buckaroo#267

Closed

mroeschke closed this as completed Jul 8, 2024

paddymul mentioned this issue Jul 15, 2024

Feat/avoid summary dict segfault paddymul/buckaroo#275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Segfault when calling df.to_json(orient='records') and numpy.datetime64 being serialized #58160

BUG: Segfault when calling df.to_json(orient='records') and numpy.datetime64 being serialized #58160

paddymul commented Apr 5, 2024

aabhinavg commented Apr 8, 2024 •

edited

Loading

paddymul commented Apr 9, 2024

paddymul commented Apr 19, 2024

paddymul commented May 25, 2024

mroeschke commented Jul 8, 2024 •

edited

Loading

BUG: Segfault when calling df.to_json(orient='records') and numpy.datetime64 being serialized #58160

BUG: Segfault when calling df.to_json(orient='records') and numpy.datetime64 being serialized #58160

Comments

paddymul commented Apr 5, 2024

Pandas version checks

Reproducible Example

aabhinavg commented Apr 8, 2024 • edited Loading

System Information:

Create a Pandas DataFrame

Print the DataFrame

Save the DataFrame as a JSON file

paddymul commented Apr 9, 2024

paddymul commented Apr 19, 2024

paddymul commented May 25, 2024

mroeschke commented Jul 8, 2024 • edited Loading

aabhinavg commented Apr 8, 2024 •

edited

Loading

mroeschke commented Jul 8, 2024 •

edited

Loading