low_memory=True in read_csv leads to non documented, silent errors #22194

diegoquintanav · 2018-08-03T23:18:52Z

Problem description

Say I have a csv file with a "considerable" number of rows, and a column that matches some value, i.e.

Note: I've marked the matched values between many *

$ grep -a --text -e "11815057" somefile.csv
9999;40043;2;DUMMY1;1;1101;DUMMY2;3;0;110;1;B;****11815057*****;1;20051105;0;1101;DUMMY3;0;0;6,8;97;P;P
9999;40043;2;DUMMY1;1;1101;DUMMY2;3;0;110;1;F;*****11815057******;1;20051105;0;1101;DUMMY3;0;0;0;0;Y;T

Number of rows (including header)

$ wc -l somefile.csv 
3308478 somefile.csv

Number of columns

$ head -1 somefile.csv | column -t -s ";" | wc -w
24

Using pandas, I could read this file, using read_csv()

In [81]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", low_memory=True)

In [82]: df[df.col_12 == "11815057"]
Out[82]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276678   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      B  11815057       1  20051105       0    1101   DUMMY3       0       0    6,8      97      P      P


[1 rows x 24 columns]

In [83]: df[df.col_12 == 11815057]
Out[83]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276879   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      F  11815057       1  20051105       0    1101   DUMMY3       0       0      0       0      Y      T

In [84]: df.shape 
Out[84]: (3308477, 24)

Note that the comparison check is not returning both rows. In other words, low_memory=True breaks silently any kind of further operations that rely on comparison checks, like slicing a dataframe, for instance. In my case, it was silently not dropping the second row using drop_duplicates(subset="col_12")

Expected Output

This can be solved using low_memory=False, which consumes more memory, but enforce consistent datatypes along the column.

In [85]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", low_memory=False)

In [87]: df[df.col_12 == 11815057]
Out[87]: 
Empty DataFrame
Columns: [col_0, col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19, col_20, col_21, col_22, col_23]
Index: []


n [97]: df[df.col_12 == "11815057" ]
Out[97]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276678   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      B  11815057       1  20051105       0    1101   DUMMY3       0       0    6,8      97      P      P
3276879   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      F  11815057       1  20051105       0    1101   DUMMY3       0       0      0       0      Y      T

I believe this should be properly documented both in the docs and in the current warning displayed after the method is called, considering that low_memory is a parameter that is mentioned as "deprecated but working" in other issues. I don't know the current status of the parameter as of now, but the comments in this commit mention that this feature is meant to work on chunks of bytes, which makes the process a lot harder to debug.

Also, what is the behavior in the same case but for read_excel()?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.4
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.9
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-08-04T14:38:06Z

In other words, low_memory=True breaks silently any kind of further operations that rely on comparison checks, like slicing a dataframe, for instance.

I think this is the wrong way of thinking about the problem. Pandas doesn't build any kind of intermediate expression tree before executing. I think the problem is entirely with the CSV reader.

What's df.dtypes with and without low_memory?

Does the same issue occur if you specify the dtype in read_csv?

diegoquintanav · 2018-08-06T14:25:34Z

In [12]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1")
/home/diego/.pyenv/versions/3.7.0/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (12) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [13]: df.columns[12]
Out[13]: 'col_12'

In [14]: df.col_12.dtypes
Out[14]: dtype('O')

In [15]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", low_memory=False)

In [16]: df.col_12.dtypes
Out[16]: dtype('O')

Specifying dtype

In [19]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", dtype={"col_12": np.str})

In [20]: df[df.col_12 == 11815057 ]
Out[20]: 
Empty DataFrame
Columns: Columns: [col_0, col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19, col_20, col_21, col_22, col_23]
Index: []

In [21]: df[df.col_12 == "11815057" ]
Out[21]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276678   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      B  11815057       1  20051105       0    1101   DUMMY3       0       0    6,8      97      P      P
3276879   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      F  11815057       1  20051105       0    1101   DUMMY3       0       0      0       0      Y      T
[2 rows x 24 columns]

chris-b1 · 2018-08-07T14:00:12Z

With low_memory=True you're getting the right warning - I wouldn't call that silent, but we certainly would take suggestions for a better one if that's not clear.

chris-b1 · 2018-08-07T14:07:14Z

low_memory is not deprecated - it is for exactly this case - would also take a PR for docs expanding out a warning showing what can go wrong

jangorecki · 2019-08-24T09:25:31Z

@diegoquintanav As of now, 0.25.1, docs mention that low_memory is only valid for C parser. In your code you did not specified if you use engine="c". From docs it is not clear if "c" or "python" engine is the default one. So you might not be using low_memory=True at all. On the other hand low_memory is described to have default value True. As a result, setting up engine="c" should automatically kick in low_memory=True. I haven't looked into the source so cannot say if that is how it actually works. My comment is based on documentation only.
I tried to optimise read_csv of 45GB csv on 125GB machine but neither dtype, engine, low_memory helped.

jbrockmendel added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Aug 4, 2018

WillAyd added Docs good first issue labels Jan 9, 2019

jangorecki mentioned this issue Aug 24, 2019

pandas/dask try to optimise read_csv to load 1e9 rows data h2oai/db-benchmark#99

Closed

mroeschke removed Error Reporting Incorrect or improved errors from pandas good first issue labels Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

low_memory=True in read_csv leads to non documented, silent errors #22194

low_memory=True in read_csv leads to non documented, silent errors #22194

diegoquintanav commented Aug 3, 2018 •

edited

Loading

TomAugspurger commented Aug 4, 2018 •

edited

Loading

diegoquintanav commented Aug 6, 2018 •

edited

Loading

chris-b1 commented Aug 7, 2018

chris-b1 commented Aug 7, 2018

jangorecki commented Aug 24, 2019

low_memory=True in read_csv leads to non documented, silent errors #22194

low_memory=True in read_csv leads to non documented, silent errors #22194

Comments

diegoquintanav commented Aug 3, 2018 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Aug 4, 2018 • edited Loading

diegoquintanav commented Aug 6, 2018 • edited Loading

chris-b1 commented Aug 7, 2018

chris-b1 commented Aug 7, 2018

jangorecki commented Aug 24, 2019

diegoquintanav commented Aug 3, 2018 •

edited

Loading

Output of `pd.show_versions()`

TomAugspurger commented Aug 4, 2018 •

edited

Loading

diegoquintanav commented Aug 6, 2018 •

edited

Loading