-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Segfault in pd.read_csv() using chunksize parameter #11793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Were you able to reproduce this bug in 0.17.0 or 0.16.2? |
@jdeschenes Yes, looks like I can reproduce it for OSX Yosemite on 0.17.0 and 0.16.2. It works fine for both versions on Ubuntu 14.04. |
this is prob a dupe of #9726 (though your examples are better!) |
I tried to look into this a little more. I think the segfault is occurring in ...
if na_filter:
for i in range(lines):
COLITER_NEXT(it, word)
k = kh_get_str(na_hashset, word)
# in the hash table
... I think I thought given the bug is OSX only maybe we ran into a compiler quirk with clang. I can't repro with clang on LInux though on a recent checkout. |
This one is resolved by PR #13788. I was able to reproduce the crash on 0.18.1 (same crash report as in issue #13703). Running on the build with the mentioned PR did not crash. edit: The symptoms described by @OEP are exactly the same as in the mentioned issue: same invalid data returned by |
As mentioned in #11793 (comment), this has been fixed. |
Here is my repro script:
and I am attaching small.csv.gz as the smallest data set I know reproduces this segfault. Running
python repro.py small.csv.gz
reproduces the segfault in 0.17.1 on OSX Yosemite. I can't reproduce with 0.13.1 or 0.17.1 on Ubuntu 14.04. Removingchunksize
works normally with that file.I tried my best to narrow it down. You can edit this file down to under 2000 lines and the segfault does not occur. Once it goes over 2000 lines I start to see the segfault. I can add lines 1000 at a time and notice the segfault is intermittent (I see it again at 6002 lines). It seems like to me if there are a multiple of
chunksize
items in the file the segfault does not occur.I installed via
pip install pandas
. I also repro'd this on latest master (43edd83) on OSX Yosemite.The text was updated successfully, but these errors were encountered: