-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
low_memory=True in read_csv leads to non documented, silent errors #22194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is the wrong way of thinking about the problem. Pandas doesn't build any kind of intermediate expression tree before executing. I think the problem is entirely with the CSV reader. What's Does the same issue occur if you specify the |
Specifying dtype
|
With |
|
@diegoquintanav As of now, 0.25.1, docs mention that |
Problem description
Say I have a csv file with a "considerable" number of rows, and a column that matches some value, i.e.
Number of rows (including header)
Number of columns
Using pandas, I could read this file, using
read_csv()
Note that the comparison check is not returning both rows. In other words,
low_memory=True
breaks silently any kind of further operations that rely on comparison checks, like slicing a dataframe, for instance. In my case, it was silently not dropping the second row usingdrop_duplicates(subset="col_12")
Expected Output
This can be solved using
low_memory=False
, which consumes more memory, but enforce consistent datatypes along the column.I believe this should be properly documented both in the docs and in the current warning displayed after the method is called, considering that
low_memory
is a parameter that is mentioned as "deprecated but working" in other issues. I don't know the current status of the parameter as of now, but the comments in this commit mention that this feature is meant to work on chunks of bytes, which makes the process a lot harder to debug.Also, what is the behavior in the same case but for
read_excel()
?Output of
pd.show_versions()
pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.4
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.9
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: