Skip to content

BUG: Segfault when calling df.to_json(orient='records') and numpy.datetime64 being serialized #58160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
paddymul opened this issue Apr 5, 2024 · 5 comments
Closed
2 of 3 tasks
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@paddymul
Copy link
Contributor

paddymul commented Apr 5, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Reproducing this bug has been very difficult.  I have a test script that runs off of my library repo that gets it to fail, but when trying to make a reduced testcase it doesn't fail.  Given that i'm filing this against an old version of pandas, I'll just describe the bug to start.  I'm happy to work with you to produce a reduced test case.

This is the code that should cause a segfault

ts = pd.Timestamp('2020-02-21T00:00:00.000000000')
np_dt = pd.Series([ts]).mode().values[0]
sdf = pd.DataFrame({'a':[ts], 'b':[np_dt]}, index=['mode'])
sdf.to_json(orient='table', indent=2, default_handler=str)


My actual code in the library (with print instrumentation)
```python
def pd_to_obj(df:pd.DataFrame):
    print("--------------")
    print("here ser 86")

    if len(df) > 4:
        df = df[20:21]
        print(df)
        val = df.iloc[0][0]
        print("val", val)
        print("type(val)", type(val))

    print("df.index", df.index)
    print("len df", len(df))
    serialized = df.to_json(orient='table')
    print("here ser 88")

this fails at the df.to_json call. Here is the debugging info outputted

--------------
here ser 86
                          EventDate
mode  2020-02-21T00:00:00.000000000
val 2020-02-21T00:00:00.000000000
type(val) <class 'numpy.datetime64'>
df.index Index(['mode'], dtype='object')
len df 1
zsh: segmentation fault  python crash_buckaroo.py


### Issue Description

Pandas 2.0.3 segafaults when executing `to_json(orient='records')`

### Expected Behavior

There shouldn't be a segfault

### Installed Versions

<details>

INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.10.14.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.6.0
Version          : Darwin Kernel Version 22.6.0: Wed Oct  4 21:25:40 PDT 2023; root:xnu-8796.141.3.701.17~4/RELEASE_ARM64_T8103
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.26.4
pytz             : 2024.1
dateutil         : 2.9.0.post0
setuptools       : 68.2.2
pip              : 23.3.1
Cython           : None
pytest           : 8.1.1
hypothesis       : None
sphinx           : 7.2.6
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.3
IPython          : 8.23.0
pandas_datareader: None
bs4              : 4.12.3
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.8.4
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.13.0
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2024.1
qtpy             : None
pyqt5            : None

</details>
@aabhinavg
Copy link

aabhinavg commented Apr 8, 2024

System Information:

OS: macOS-13.5-arm64-arm-64bit
Architecture: ('64bit', '')
Machine: arm64
Processor: arm
Byte order: little
Locale Information:
LC_ALL: (None, 'UTF-8')
LANG: (None, 'UTF-8')
Package Versions:
pandas: 3.0.0.dev0+680.gb8a4691647
numpy: 1.24.2
pytz: 2023.3
dateutil: 2.8.2
setuptools: 58.0.4
pip: 24.0
pytest: 8.1.1
IPython: 8.18.1
matplotlib: 3.6.3
scipy: 1.10.1

TestCase I used
import pandas as pd

Create a Pandas DataFrame

ts = pd.Timestamp('2020-02-21T00:00:00.000000000')
np_dt = pd.Series([ts]).mode().values[0]
sdf = pd.DataFrame({'a': [ts], 'b': [np_dt]}, index=['mode'])

Print the DataFrame

print("DataFrame:")
print(sdf)

Save the DataFrame as a JSON file

json_filename = "data.json"
sdf.to_json(json_filename, orient='table', indent=2, default_handler=str)
print(f"DataFrame saved as JSON file: {json_filename}")

python3 58160.py
Output:
DataFrame:
a b
mode 2020-02-21 2020-02-21
DataFrame saved as JSON file: data.json

data.json content
{
"schema":{
"fields":[
{
"name":"index",
"type":"string"
},
{
"name":"a",
"type":"datetime"
},
{
"name":"b",
"type":"datetime"
}
],
"primaryKey":[
"index"
],
"pandas_version":"1.4.0"
},
"data":[
{
"index":"mode",
"a":"2020-02-21T00:00:00.000",
"b":"2020-02-21T00:00:00.000"
}
]
}
I am not able to see any problem with the above-mentioned configuration can you try it.

@paddymul
Copy link
Contributor Author

paddymul commented Apr 9, 2024

As I said in the bug. The approximate reproduction code doesn't cause a segfault. I also included the code in my application next to where the segfault happens, along with debugging print statements. I think I'm constructing a dataframe with the same values as the failing one, but the failing dataframe is the result of a series of summary functions being applied to a dataframe and all of the summary values are constructed into a single dataframe. The memory layout for the failing dataframe is probably different than the value equivalent dataframe. I don't know how to succinctly reproduce the failing dataframe.

I'm happy to dig into this problem with whoever is interested. You might get an interesting test case out of it.

@paddymul
Copy link
Contributor Author

I created a pickle file with a simplified dataframe that can reproduce the segmentation fault simplified_df.pckl

the value_counts' row is the problem. The value for EventDateis the result of callingdatetime_series.value_counts()` so I think it's a dataframe.

Here is the test script

import pandas as pd
simplified_df = pd.read_pickle('simplified_df.pckl')
pd.show_versions()
print("simplified_df")
print(simplified_df)
print("-" * 80)
print("simplified_df.dtypes")
print(simplified_df.dtypes)
print("-" * 80)
vc_bad_row = simplified_df['EventDate'].loc['value_counts']
print("vc_bad_row")
print(vc_bad_row)
print("-" * 80)
print("type(vc_bad_row)")
print(type(vc_bad_row))
print("-" * 80)
print("vc_bad_row.dtype")
print(vc_bad_row.dtype)
print("-" * 80)
print("before to_json")
simplified_df.to_json(orient='table')
print("after  to_json")

which outputs the following on my machine

python crash_buckaroo_simple.py 
python crash_buckaroo_simple.py 
/Users/paddy/anaconda3/envs/buckaroo-dev-6/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 965ceca9fd796940050d6fc817707bba1c4f9bff
python           : 3.8.18.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 22.6.0
Version          : Darwin Kernel Version 22.6.0: Wed Oct  4 21:25:40 PDT 2023; root:xnu-8796.141.3.701.17~4/RELEASE_ARM64_T8103
machine          : arm64
processor        : arm
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.2
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 68.2.2
pip              : 23.3.1
Cython           : None
pytest           : 7.3.2
hypothesis       : None
sphinx           : 7.1.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.12.2
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : 1.3.5
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.7.5
numba            : None
numexpr          : 2.8.4
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 14.0.2
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.1
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None
simplified_df
                                         index                                          EventDate
length                                       1                                                  1
min                                          0                                                NaN
max                                          0                                                NaN
mean                                       0.0                                                  0
nan_count                                    0                                                  0
value_counts  0    1
Name: count, dtype: int64  EventDate
2020-02-21    1
Name: count, dtype: ...
mode                                         0                      2020-02-21T00:00:00.000000000
--------------------------------------------------------------------------------
simplified_df.dtypes
index        object
EventDate    object
dtype: object
--------------------------------------------------------------------------------
vc_bad_row
EventDate
2020-02-21    1
Name: count, dtype: int64
--------------------------------------------------------------------------------
type(vc_bad_row)
<class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------
vc_bad_row.dtype
int64
--------------------------------------------------------------------------------
before to_json
zsh: segmentation fault  python crash_buckaroo_simple.py

@paddymul
Copy link
Contributor Author

I now have a minimally reproducible example

import pandas as pd
import numpy as np
bad_df = pd.DataFrame({'column': { 'foo':8, 'bad_val': np.datetime64('2005-02-25')}})
bad_df.to_json(orient='table')

@mroeschke
Copy link
Member

mroeschke commented Jul 8, 2024

This looks to be fixed on main, so it appears to have been addressed since 2.0.3 so closing

In [1]: import pandas as pd
   ...: import numpy as np
   ...: bad_df = pd.DataFrame({'column': { 'foo':8, 'bad_val': np.datetime64('2005-02-25')}})
   ...: bad_df.to_json(orient='table')
Out[1]: '{"schema":{"fields":[{"name":"index","type":"string"},{"name":"column","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":"foo","column":8},{"index":"bad_val","column":"2005-02-25T00:00:00.000"}]}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants