-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm using python version 2.7.5 and panda version 0.12.0. |
@Hedendahl I'm not getting this error in master.
Can you confirm? |
May have been this: #5156 |
When I execute the consider code and let the chunksize be 999 rows I get the following output,
If I change the chunksize to 998 rows, I instead get
|
I tried the same test (switch to 998) on master and it worked. @Hedendahl could you try this on the latest master rather than version 0.12? You can download it here: |
can you PR with some test based on this issue |
Yes, but is it necessary? This fault occurs on 0.18.1 only for engine="c" and even without encoding='utf-8' parameter, because the data is ASCII anyway. This code reproduces the issue: import pandas as pd
from cStringIO import StringIO
import pandas.util.testing as tm
def test_issue5291(self):
# This test recreates the segfault in issue #5291
n_rows, n_columns, chunksize = 10000, 20, 999
expected = pd.DataFrame([[float(row)] * n_columns
for row in range(n_rows)],
dtype=float, columns=None, index=None)
# Now read the same dataframe in chunks from a CSV
data = "\r\n".join(",".join(["%.1f"%(r,)] * n_columns)
for r in range(n_rows)) + "\r\n"
chunks_ = self.read_csv(StringIO(data), header=None,
chunksize=chunksize)
result = pd.concat(chunks_, axis=0, ignore_index=True)
# Check for data corruption if there was no segfault
tm.assert_frame_equal(result, expected)
test_issue5291(pd) It is identical to edit: The fault in the original example by @Hedendahl and in this one does not occur on the current master build. |
why not add a test |
actually instead just take original test and use with and w/o encoding |
no separate things go in separate PR's |
xref pandas-devgh-13833. Closes pandas-devgh-5291. (cherry picked from commit 6f0ff1a)
I have encountered an issue with the csv parser, pandas.io.parser.read_csv. I get segmentation fault or UnicodeDecodeError when reading a csv-file in chunks and it seems like that the problem depends on the size of the chunks.
Consider the following code:
I get segmentation fault from the attached code, when the chunksize is 999 rows. If the chunksize is decreased to 998 rows, I instead get an UnicodeDecodeError. If the chunksize is increased to 1000 rows there are no problems of reading the csv-file. My first guess was that the problem appears when the last chunk include too few rows, but I was surprised when reading of the csv-file with the following setting,
worked properly.
The text was updated successfully, but these errors were encountered: