Skip to content

DOC: low_memory in read_csv #13293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

chris-b1
Copy link
Contributor

low_memory : boolean, default ``True``
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
types either set False, or specify the type with the `dtype` parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back-ticks on False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype as well

@jreback jreback added Docs IO CSV read_csv, to_csv labels May 26, 2016
@jreback jreback added this to the 0.18.2 milestone May 26, 2016
@codecov-io
Copy link

Current coverage is 84.20%

Merging #13293 into master will not change coverage

@@             master     #13293   diff @@
==========================================
  Files           138        138          
  Lines         50587      50587          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          42593      42593          
  Misses         7994       7994          
  Partials          0          0          

Powered by Codecov. Last updated by b4e2d34...282f1f3

@chris-b1
Copy link
Contributor Author

@jreback - updated

@jreback jreback closed this in 4b05055 May 26, 2016
@jreback
Copy link
Contributor

jreback commented May 26, 2016

thanks!

@@ -169,6 +169,13 @@ skipfooter : int, default ``0``
Number of lines at bottom of file to skip (unsupported with engine='c').
nrows : int, default ``None``
Number of rows of file to read. Useful for reading pieces of large files.
low_memory : boolean, default ``True``
Internally process the file in chunks, resulting in lower memory use
while parsing, but possibly mixed type inference. To ensure no mixed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'resulting in [...] mixed type inference' mean? Is it implementation-specific which type is chosen? Is it deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

from io import StringIO

# just ints - no warning
pd.read_csv(StringIO(
    '\n'.join(str(x) for x in range(10000)))
).dtypes
Out[24]: 
0    int64
dtype: object

# mixed dtype - but fits into a single chunk, so no warning and all parsed as strings
pd.read_csv(StringIO(
    '\n'.join([str(x) for x in range(10000)] + ['a string']))
).dtypes
Out[26]: 
0    object
dtype: object

# mixed dtype - doesn't fit into a single chunk
pd.read_csv(StringIO(
    '\n'.join([str(x) for x in range(1000000)] + ['a string']))
).dtypes
DtypeWarning: Columns (0) have mixed types. 
Specify dtype option on import or set low_memory=False.

Out[27]: 
0    object
dtype: object

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation and the example!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To complete the example, I see:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/DOC: status of low_memory kwarg of read_csv/table
4 participants