-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: low_memory in read_csv #13293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: low_memory in read_csv #13293
Conversation
low_memory : boolean, default ``True`` | ||
Internally process the file in chunks, resulting in lower memory use | ||
while parsing, but possibly mixed type inference. To ensure no mixed | ||
types either set False, or specify the type with the `dtype` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back-ticks on False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype as well
Current coverage is 84.20%@@ master #13293 diff @@
==========================================
Files 138 138
Lines 50587 50587
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 42593 42593
Misses 7994 7994
Partials 0 0
|
@jreback - updated |
thanks! |
@@ -169,6 +169,13 @@ skipfooter : int, default ``0`` | |||
Number of lines at bottom of file to skip (unsupported with engine='c'). | |||
nrows : int, default ``None`` | |||
Number of rows of file to read. Useful for reading pieces of large files. | |||
low_memory : boolean, default ``True`` | |||
Internally process the file in chunks, resulting in lower memory use | |||
while parsing, but possibly mixed type inference. To ensure no mixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 'resulting in [...] mixed type inference' mean? Is it implementation-specific which type is chosen? Is it deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.
from io import StringIO
# just ints - no warning
pd.read_csv(StringIO(
'\n'.join(str(x) for x in range(10000)))
).dtypes
Out[24]:
0 int64
dtype: object
# mixed dtype - but fits into a single chunk, so no warning and all parsed as strings
pd.read_csv(StringIO(
'\n'.join([str(x) for x in range(10000)] + ['a string']))
).dtypes
Out[26]:
0 object
dtype: object
# mixed dtype - doesn't fit into a single chunk
pd.read_csv(StringIO(
'\n'.join([str(x) for x in range(1000000)] + ['a string']))
).dtypes
DtypeWarning: Columns (0) have mixed types.
Specify dtype option on import or set low_memory=False.
Out[27]:
0 object
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the explanation and the example!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To complete the example, I see:
df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
type(df.loc[524287,'0'])
Out[50]: int
type(df.loc[524288,'0'])
Out[51]: str
The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.
git diff upstream/master | flake8 --diff