Skip to content

low_memory=True in read_csv leads to non documented, silent errors #22194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
diegoquintanav opened this issue Aug 3, 2018 · 5 comments
Open
Labels
Docs IO CSV read_csv, to_csv

Comments

@diegoquintanav
Copy link

diegoquintanav commented Aug 3, 2018

Problem description

Say I have a csv file with a "considerable" number of rows, and a column that matches some value, i.e.

Note: I've marked the matched values between many *

$ grep -a --text -e "11815057" somefile.csv
9999;40043;2;DUMMY1;1;1101;DUMMY2;3;0;110;1;B;****11815057*****;1;20051105;0;1101;DUMMY3;0;0;6,8;97;P;P
9999;40043;2;DUMMY1;1;1101;DUMMY2;3;0;110;1;F;*****11815057******;1;20051105;0;1101;DUMMY3;0;0;0;0;Y;T

Number of rows (including header)

$ wc -l somefile.csv 
3308478 somefile.csv

Number of columns

$ head -1 somefile.csv | column -t -s ";" | wc -w
24

Using pandas, I could read this file, using read_csv()

In [81]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", low_memory=True)

In [82]: df[df.col_12 == "11815057"]
Out[82]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276678   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      B  11815057       1  20051105       0    1101   DUMMY3       0       0    6,8      97      P      P


[1 rows x 24 columns]

In [83]: df[df.col_12 == 11815057]
Out[83]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276879   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      F  11815057       1  20051105       0    1101   DUMMY3       0       0      0       0      Y      T

In [84]: df.shape 
Out[84]: (3308477, 24)   

Note that the comparison check is not returning both rows. In other words, low_memory=True breaks silently any kind of further operations that rely on comparison checks, like slicing a dataframe, for instance. In my case, it was silently not dropping the second row using drop_duplicates(subset="col_12")

Expected Output

This can be solved using low_memory=False, which consumes more memory, but enforce consistent datatypes along the column.

In [85]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", low_memory=False)

In [87]: df[df.col_12 == 11815057]
Out[87]: 
Empty DataFrame
Columns: [col_0, col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19, col_20, col_21, col_22, col_23]
Index: []


n [97]: df[df.col_12 == "11815057" ]
Out[97]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276678   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      B  11815057       1  20051105       0    1101   DUMMY3       0       0    6,8      97      P      P
3276879   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      F  11815057       1  20051105       0    1101   DUMMY3       0       0      0       0      Y      T

I believe this should be properly documented both in the docs and in the current warning displayed after the method is called, considering that low_memory is a parameter that is mentioned as "deprecated but working" in other issues. I don't know the current status of the parameter as of now, but the comments in this commit mention that this feature is meant to work on chunks of bytes, which makes the process a lot harder to debug.

Also, what is the behavior in the same case but for read_excel()?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.4
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.9
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jbrockmendel jbrockmendel added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Aug 4, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 4, 2018

In other words, low_memory=True breaks silently any kind of further operations that rely on comparison checks, like slicing a dataframe, for instance.

I think this is the wrong way of thinking about the problem. Pandas doesn't build any kind of intermediate expression tree before executing. I think the problem is entirely with the CSV reader.

What's df.dtypes with and without low_memory?

Does the same issue occur if you specify the dtype in read_csv?

@diegoquintanav
Copy link
Author

diegoquintanav commented Aug 6, 2018

In [12]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1")
/home/diego/.pyenv/versions/3.7.0/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (12) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [13]: df.columns[12]
Out[13]: 'col_12'

In [14]: df.col_12.dtypes
Out[14]: dtype('O')

In [15]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", low_memory=False)

In [16]: df.col_12.dtypes
Out[16]: dtype('O')

Specifying dtype

In [19]: df = pd.read_csv("somefile.csv", sep=";", encoding="latin1", dtype={"col_12": np.str})

In [20]: df[df.col_12 == 11815057 ]
Out[20]: 
Empty DataFrame
Columns: Columns: [col_0, col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19, col_20, col_21, col_22, col_23]
Index: []

In [21]: df[df.col_12 == "11815057" ]
Out[21]: 
         col_0  col_1  col_2   col_3  col_4  col_5    col_6  col_7  col_8  col_9  col_10 col_11    col_12  col_13    col_14  col_15  col_16   col_17  col_18  col_19 col_20  col_21 col_22 col_23
3276678   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      B  11815057       1  20051105       0    1101   DUMMY3       0       0    6,8      97      P      P
3276879   9999  40043      2  DUMMY1      1   1101   DUMMY2      3      0    110       1      F  11815057       1  20051105       0    1101   DUMMY3       0       0      0       0      Y      T
[2 rows x 24 columns]

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 7, 2018

With low_memory=True you're getting the right warning - I wouldn't call that silent, but we certainly would take suggestions for a better one if that's not clear.

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 7, 2018

low_memory is not deprecated - it is for exactly this case - would also take a PR for docs expanding out a warning showing what can go wrong

@jangorecki
Copy link

@diegoquintanav As of now, 0.25.1, docs mention that low_memory is only valid for C parser. In your code you did not specified if you use engine="c". From docs it is not clear if "c" or "python" engine is the default one. So you might not be using low_memory=True at all. On the other hand low_memory is described to have default value True. As a result, setting up engine="c" should automatically kick in low_memory=True. I haven't looked into the source so cannot say if that is how it actually works. My comment is based on documentation only.
I tried to optimise read_csv of 45GB csv on 125GB machine but neither dtype, engine, low_memory helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

7 participants