-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@normanius When I read the data using any of the methods that "fixed" the problem, I ended up with data in the wrong columns, e.g. the datatime columns were shifted a couple of columns to the left. I can understand the qualms about the inconsistent behavior with slightly different files, but I would think that inconsistent data would be more of an issue. In any case, I defer to those more knowledgeable about |
Correct, the data is inconsistent. Unfortunately, I can fix this only in retrospect, and pandas is my tool of choice here. The problem is actually relatively easy to fix - given that pandas is able to read the file. I created this report because the observed behavior of |
Do we know which version introduced this bug? It's really annoying that we can't get around this |
As an explanation what is going on here: If low memory is True, the file is read in chunks. Unfortunately every chunk determines the number of columns for itself. One of your chunks starts with 15 columns, hence the error when it encounters 23. One workaround (if you know the number of columns) is to set the names attribute. This forces 23 as expected columns and does not raise an error. |
I think the problem is obvious - you're running out of memory because you're trying to load so much data into memory at once, and then process it. You need to either:
The problem with the first approach is it won't scale indefinitely and is expensive. The second way is the right way to do it, but needs more coding. Also, the pandas.parser.CParserError: Error tokenizing data generated when reading a file written by pandas.to_csv(), it might be because there is a carriage return ('\r') in a column names, in which case to_csv() will actually write the subsequent column names into the first column of the data frame, it will cause a difference between the number of columns in the first X rows. This difference is one cause of the CParserError . |
This has nothing to do with memory issues. This is an implementation bug occurring when reading in chunks |
I sometimes receive a
Error tokenizing data. C error: ...
for tables that can normally be read withread_csv()
without any problems.Find attached the .csv file sample.tar.gz for which I can reproduce the problem.
This raises the following exception:
The tables I try to read have 23 columns, as declared correctly in the file header. However, the files contain corrupted lines (very few, <0.01% of all lines), where the data of 8 columns are omitted. For those lines, 8 delimiters are missing.
I'm working with about 100 different files containing 1M to 20M lines. All files suffer from the same kind of ill-formatted lines.
read_csv()
graciously handles those lines most of the time. Only for the file provided above, it raises an exception.I can avoid the exception as follows:
engine="python"
(slow)low_memory=False
error_bad_lines=False
(drops a couple of lines)In summary, I think
read_csv()
behaves inconsistently if running withlow-memory=True
and C-engine.I first thought that the problem is related to issue #11166, but I'm not sure 100%.
sample.tar.gz
I'm running python3.8 and pandas 1.2.3. See details below.
Expected Output
No exception for file
sample.csv
, regardless of the settings forengine
andlow_memory
.System
The text was updated successfully, but these errors were encountered: