BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

normanius · 2021-03-23T14:54:02Z

I sometimes receive a Error tokenizing data. C error: ... for tables that can normally be read with read_csv() without any problems.

Find attached the .csv file sample.tar.gz for which I can reproduce the problem.

import pandas as pd
path = "sample.csv"
pd.read_csv(path, sep=";", header=[0,1])

This raises the following exception:

ParserError: Error tokenizing data. C error: Expected 15 fields in line 983050, saw 23

The tables I try to read have 23 columns, as declared correctly in the file header. However, the files contain corrupted lines (very few, <0.01% of all lines), where the data of 8 columns are omitted. For those lines, 8 delimiters are missing.

I'm working with about 100 different files containing 1M to 20M lines. All files suffer from the same kind of ill-formatted lines. read_csv() graciously handles those lines most of the time. Only for the file provided above, it raises an exception.

I can avoid the exception as follows:

Delete a couple of unrelated (healthy) lines at the beginning of the document
By setting engine="python" (slow)
By setting low_memory=False
By setting error_bad_lines=False (drops a couple of lines)

In summary, I think read_csv() behaves inconsistently if running with low-memory=True and C-engine.

I first thought that the problem is related to issue #11166, but I'm not sure 100%.

sample.tar.gz

I'm running python3.8 and pandas 1.2.3. See details below.

Expected Output

No exception for file sample.csv, regardless of the settings for engine and low_memory.

System

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.8.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Fri Oct 30 12:37:06 PDT 2020; root:xnu-4903.278.44.0.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : None.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 54.1.2
Cython           : None
pytest           : 6.2.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : 0.8.5
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 1.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.3.2
sqlalchemy       : None
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

The text was updated successfully, but these errors were encountered:

nmay231 · 2021-03-25T00:32:46Z

@normanius When I read the data using any of the methods that "fixed" the problem, I ended up with data in the wrong columns, e.g. the datatime columns were shifted a couple of columns to the left.

I can understand the qualms about the inconsistent behavior with slightly different files, but I would think that inconsistent data would be more of an issue.

In any case, I defer to those more knowledgeable about read_csv to address consistency issues.

normanius · 2021-03-25T01:15:41Z

Correct, the data is inconsistent. Unfortunately, I can fix this only in retrospect, and pandas is my tool of choice here. The problem is actually relatively easy to fix - given that pandas is able to read the file.

I created this report because the observed behavior of read_csv occurs only sometimes, which may hint on a possible flaw in the algorithm. But I also understand that read_csv cannot handle all possible kinds of inconsistencies.

meettaraviya · 2021-05-27T20:03:09Z

Do we know which version introduced this bug? It's really annoying that we can't get around this

phofl · 2022-01-28T20:22:07Z

As an explanation what is going on here:

If low memory is True, the file is read in chunks. Unfortunately every chunk determines the number of columns for itself. One of your chunks starts with 15 columns, hence the error when it encounters 23.

One workaround (if you know the number of columns) is to set the names attribute. This forces 23 as expected columns and does not raise an error.

ronaldgevern · 2022-07-12T05:10:42Z

I think the problem is obvious - you're running out of memory because you're trying to load so much data into memory at once, and then process it.

You need to either:

get a machine with more memory.
re-architect the solution to use a pipelined approach using either a generator or coroutine pipeline to do the processing stepwise over your data.

The problem with the first approach is it won't scale indefinitely and is expensive. The second way is the right way to do it, but needs more coding.

Also, the pandas.parser.CParserError: Error tokenizing data generated when reading a file written by pandas.to_csv(), it might be because there is a carriage return ('\r') in a column names, in which case to_csv() will actually write the subsequent column names into the first column of the data frame, it will cause a difference between the number of columns in the first X rows. This difference is one cause of the CParserError .

phofl · 2022-07-12T10:34:18Z

This has nothing to do with memory issues. This is an implementation bug occurring when reading in chunks

normanius added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 23, 2021

jbrockmendel added the IO CSV read_csv, to_csv label Jun 6, 2021

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 23, 2021

phofl mentioned this issue Jul 1, 2022

BUG: CSV C engine raises an error on single line CSV with no header when passing extra names #47566

Closed

3 tasks

phofl mentioned this issue Oct 19, 2022

BUG: Inconsistent handling of short lines #49175

Closed

3 tasks

topper-123 mentioned this issue May 13, 2023

BUG: CSV file with carriage return breaks pandas #51141

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

normanius commented Mar 23, 2021 •

edited

Loading

nmay231 commented Mar 25, 2021

normanius commented Mar 25, 2021

meettaraviya commented May 27, 2021

phofl commented Jan 28, 2022

ronaldgevern commented Jul 12, 2022

phofl commented Jul 12, 2022

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

Comments

normanius commented Mar 23, 2021 • edited Loading

Expected Output

System

nmay231 commented Mar 25, 2021

normanius commented Mar 25, 2021

meettaraviya commented May 27, 2021

phofl commented Jan 28, 2022

ronaldgevern commented Jul 12, 2022

phofl commented Jul 12, 2022

normanius commented Mar 23, 2021 •

edited

Loading