BUG: dataframe concatenation segementation fault (core dump) #39144

stokhos · 2021-01-13T05:58:43Z

[ X ] I have checked that this issue has not already been reported.
[ X ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

    df = pd.concat(
        (
            pd.read_csv(os.path.join(dirpath, filename))
            for dirpath, filename in file_generator
        ),
        ignore_index=True,
    )

Problem description

A subgroup of my files can cause the above function to break. And it happens 100% with those files with pandas1.2.0, with 1.1.5 no problem.

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

INSTALLED VERSIONS ------------------ commit : 3e89b4c python : 3.9.1.final.0 python-bits : 64 OS : Linux OS-release : 5.9.16-200.fc33.x86_64 Version : #1 SMP Mon Dec 21 14:08:22 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.1.3
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: 0.9.0
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-01-13T13:28:56Z

Please provide a reproducible example https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

stokhos · 2021-01-13T14:46:21Z

@phofl Is there a way for me to pin down the bug on my side? It only happens to to files in two folder. I can't do the minimal bug report, because it has more than 1000 files.

phofl · 2021-01-13T15:12:43Z

You could do a git bisect to narrow down when this was introduced

IAlternatively it should be sufficientto use only these files? Maybe you could create the DataFrames casing the segfaults in code?

jorisvandenbossche · 2021-01-13T15:16:26Z

@stokhos a good start would probably to check whether the segfault happens in the concat function or actually in the read_csv (my guess would be the second). If you rewrite your code a little bit in multiple lines, you should be able to see on which line it fails.

In case it's caused by read_csv, you could then try to find out which of the 1000 files is failing to read (eg read them in a for loop and print the file path before reading, to see which one is failing)

MarcoGorelli · 2021-02-07T15:18:23Z

Closing for now, but please do let us know if you have a reproducible example, as per:

Please provide a reproducible example https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

djherbis · 2021-03-03T18:03:24Z

I was able to reliably reproduce a segfault with a single row CSV on read_csv on pandas 1.2.0, however upgrading to 1.2.3 fixed it.

MarcoGorelli · 2021-03-03T18:05:33Z

@djherbis could you share that csv please?

djherbis · 2021-03-03T18:07:39Z

It's unfortunately got some private data in it, I'll try to narrow it down to single row, single column, and scrub the data so that it repros without private info.

djherbis · 2021-03-03T18:10:24Z

Looks like this may have been the same issue as #38753 (comment) which was fixed since using the float_precision="legacy" worked even on 1.2.0.

djherbis · 2021-03-03T18:27:41Z

Yeah I think it's that issue, narrowed it down to just this:

echo "id" > crash.csv
echo "0E59181149492B9CCD27179CDB27EBEC" >> crash.csv

import pandas as pd
pd.read_csv('crash.csv')

Looks like pandas treats that string as an exponent number.

MarcoGorelli · 2021-03-03T18:28:22Z

Thanks @djherbis - I'll do a bisect later, we may need to add a test for this

djherbis · 2021-03-03T18:29:01Z

You can check the other bug I linked too, I believe they may have already reported and fixed this particular issue.

MarcoGorelli · 2021-03-04T17:55:26Z

This was fixed in #38789

b438f747d436f7f25ba3e2e04ecf5d70c4121ab4 is the first bad commit
commit b438f747d436f7f25ba3e2e04ecf5d70c4121ab4
Author: mzeitlin11 <[email protected]>
Date:   Wed Dec 30 13:40:59 2020 -0500

    BUG: Fix precise_xstrtod segfault on long exponent (#38789)

so I think we can close - thanks anyway for the useful reproducer!

stokhos added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021

phofl added Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021

jbrockmendel added the Segfault Non-Recoverable Error label Jan 20, 2021

MarcoGorelli closed this as completed Feb 7, 2021

MarcoGorelli reopened this Mar 3, 2021

MarcoGorelli closed this as completed Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dataframe concatenation segementation fault (core dump) #39144

BUG: dataframe concatenation segementation fault (core dump) #39144

stokhos commented Jan 13, 2021 •

edited

Loading

phofl commented Jan 13, 2021

stokhos commented Jan 13, 2021

phofl commented Jan 13, 2021

jorisvandenbossche commented Jan 13, 2021

MarcoGorelli commented Feb 7, 2021

djherbis commented Mar 3, 2021

MarcoGorelli commented Mar 3, 2021

djherbis commented Mar 3, 2021

djherbis commented Mar 3, 2021

djherbis commented Mar 3, 2021

MarcoGorelli commented Mar 3, 2021

djherbis commented Mar 3, 2021

MarcoGorelli commented Mar 4, 2021

BUG: dataframe concatenation segementation fault (core dump) #39144

BUG: dataframe concatenation segementation fault (core dump) #39144

Comments

stokhos commented Jan 13, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

phofl commented Jan 13, 2021

stokhos commented Jan 13, 2021

phofl commented Jan 13, 2021

jorisvandenbossche commented Jan 13, 2021

MarcoGorelli commented Feb 7, 2021

djherbis commented Mar 3, 2021

MarcoGorelli commented Mar 3, 2021

djherbis commented Mar 3, 2021

djherbis commented Mar 3, 2021

djherbis commented Mar 3, 2021

MarcoGorelli commented Mar 3, 2021

djherbis commented Mar 3, 2021

MarcoGorelli commented Mar 4, 2021

stokhos commented Jan 13, 2021 •

edited

Loading

Output of `pd.show_versions()`