Skip to content

DOC: low_memory in read_csv #13293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,13 @@ skipfooter : int, default ``0``
Number of lines at bottom of file to skip (unsupported with engine='c').
nrows : int, default ``None``
Number of rows of file to read. Useful for reading pieces of large files.
low_memory : boolean, default ``True``
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'resulting in [...] mixed type inference' mean? Is it implementation-specific which type is chosen? Is it deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

from io import StringIO

# just ints - no warning
pd.read_csv(StringIO(
    '\n'.join(str(x) for x in range(10000)))
).dtypes
Out[24]: 
0    int64
dtype: object

# mixed dtype - but fits into a single chunk, so no warning and all parsed as strings
pd.read_csv(StringIO(
    '\n'.join([str(x) for x in range(10000)] + ['a string']))
).dtypes
Out[26]: 
0    object
dtype: object

# mixed dtype - doesn't fit into a single chunk
pd.read_csv(StringIO(
    '\n'.join([str(x) for x in range(1000000)] + ['a string']))
).dtypes
DtypeWarning: Columns (0) have mixed types. 
Specify dtype option on import or set low_memory=False.

Out[27]: 
0    object
dtype: object

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation and the example!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To complete the example, I see:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.

types either set ``False``, or specify the type with the ``dtype`` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the ``chunksize`` or ``iterator`` parameter to return the data in chunks.
(Only valid with C parser)

NA and Missing Data Handling
++++++++++++++++++++++++++++
Expand Down
7 changes: 7 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,13 @@
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the type with the `dtype` parameter.
Note that the entire file is read into a single DataFrame regardless,
use the `chunksize` or `iterator` parameter to return the data in chunks.
(Only valid with C parser)

Returns
-------
Expand Down