Skip to content

BUG: dataframe concatenation segementation fault (core dump) #39144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task
stokhos opened this issue Jan 13, 2021 · 13 comments
Closed
1 task

BUG: dataframe concatenation segementation fault (core dump) #39144

stokhos opened this issue Jan 13, 2021 · 13 comments
Labels
Bug Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode Segfault Non-Recoverable Error

Comments

@stokhos
Copy link

stokhos commented Jan 13, 2021

  • [ X ] I have checked that this issue has not already been reported.

  • [ X ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

    df = pd.concat(
        (
            pd.read_csv(os.path.join(dirpath, filename))
            for dirpath, filename in file_generator
        ),
        ignore_index=True,
    )

Problem description

A subgroup of my files can cause the above function to break. And it happens 100% with those files with pandas1.2.0, with 1.1.5 no problem.

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

INSTALLED VERSIONS ------------------ commit : 3e89b4c python : 3.9.1.final.0 python-bits : 64 OS : Linux OS-release : 5.9.16-200.fc33.x86_64 Version : #1 SMP Mon Dec 21 14:08:22 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.1.3
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: 0.9.0
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

@stokhos stokhos added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021
@phofl
Copy link
Member

phofl commented Jan 13, 2021

Please provide a reproducible example https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@phofl phofl added Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2021
@stokhos
Copy link
Author

stokhos commented Jan 13, 2021

@phofl Is there a way for me to pin down the bug on my side? It only happens to to files in two folder. I can't do the minimal bug report, because it has more than 1000 files.

@phofl
Copy link
Member

phofl commented Jan 13, 2021

You could do a git bisect to narrow down when this was introduced

IAlternatively it should be sufficientto use only these files? Maybe you could create the DataFrames casing the segfaults in code?

@jorisvandenbossche
Copy link
Member

@stokhos a good start would probably to check whether the segfault happens in the concat function or actually in the read_csv (my guess would be the second). If you rewrite your code a little bit in multiple lines, you should be able to see on which line it fails.

In case it's caused by read_csv, you could then try to find out which of the 1000 files is failing to read (eg read them in a for loop and print the file path before reading, to see which one is failing)

@jbrockmendel jbrockmendel added the Segfault Non-Recoverable Error label Jan 20, 2021
@MarcoGorelli
Copy link
Member

Closing for now, but please do let us know if you have a reproducible example, as per:

Please provide a reproducible example https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@djherbis
Copy link

djherbis commented Mar 3, 2021

I was able to reliably reproduce a segfault with a single row CSV on read_csv on pandas 1.2.0, however upgrading to 1.2.3 fixed it.

@MarcoGorelli
Copy link
Member

@djherbis could you share that csv please?

@djherbis
Copy link

djherbis commented Mar 3, 2021

It's unfortunately got some private data in it, I'll try to narrow it down to single row, single column, and scrub the data so that it repros without private info.

@djherbis
Copy link

djherbis commented Mar 3, 2021

Looks like this may have been the same issue as #38753 (comment) which was fixed since using the float_precision="legacy" worked even on 1.2.0.

@djherbis
Copy link

djherbis commented Mar 3, 2021

Yeah I think it's that issue, narrowed it down to just this:

echo "id" > crash.csv
echo "0E59181149492B9CCD27179CDB27EBEC" >> crash.csv
import pandas as pd
pd.read_csv('crash.csv')

Looks like pandas treats that string as an exponent number.

@MarcoGorelli MarcoGorelli reopened this Mar 3, 2021
@MarcoGorelli
Copy link
Member

Thanks @djherbis - I'll do a bisect later, we may need to add a test for this

@djherbis
Copy link

djherbis commented Mar 3, 2021

You can check the other bug I linked too, I believe they may have already reported and fixed this particular issue.

@MarcoGorelli
Copy link
Member

This was fixed in #38789

b438f747d436f7f25ba3e2e04ecf5d70c4121ab4 is the first bad commit
commit b438f747d436f7f25ba3e2e04ecf5d70c4121ab4
Author: mzeitlin11 <[email protected]>
Date:   Wed Dec 30 13:40:59 2020 -0500

    BUG: Fix precise_xstrtod segfault on long exponent (#38789)

so I think we can close - thanks anyway for the useful reproducer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode Segfault Non-Recoverable Error
Projects
None yet
Development

No branches or pull requests

6 participants