Skip to content

Commit a424bb2

Browse files
frankclearyjreback
authored andcommitted
DOC: pd.read_csv doc-string clarification #11555
closes #11555 Updated IO Tools documentation for read_csv() and read_table() to be consistent with the doc-string, also reordered keywords to group them more logically. Also updated merging.rst docs for concat. Author: Frank Cleary <[email protected]> Closes #12256 from frankcleary/gh11555-read_csv-docs and squashes the following commits: 20161d9 [Frank Cleary] DOC: pd.read_csv doc-string clarification #11555
1 parent dcc7cca commit a424bb2

File tree

4 files changed

+384
-290
lines changed

4 files changed

+384
-290
lines changed

doc/source/io.rst

+201-125
Original file line numberDiff line numberDiff line change
@@ -72,123 +72,201 @@ CSV & Text files
7272
----------------
7373

7474
The two workhorse functions for reading text files (a.k.a. flat files) are
75-
:func:`~pandas.io.parsers.read_csv` and :func:`~pandas.io.parsers.read_table`.
76-
They both use the same parsing code to intelligently convert tabular
77-
data into a DataFrame object. See the :ref:`cookbook<cookbook.csv>`
78-
for some advanced strategies
75+
:func:`read_csv` and :func:`read_table`. They both use the same parsing code to
76+
intelligently convert tabular data into a DataFrame object. See the
77+
:ref:`cookbook<cookbook.csv>` for some advanced strategies.
78+
79+
Parsing options
80+
'''''''''''''''
81+
82+
:func:`read_csv` and :func:`read_table` accept the following arguments:
83+
84+
Basic
85+
+++++
86+
87+
filepath_or_buffer : various
88+
Either a path to a file (a :class:`python:str`, :class:`python:pathlib.Path`,
89+
or :class:`py:py._path.local.LocalPath`), URL (including http, ftp, and S3
90+
locations), or any object with a ``read()`` method (such as an open file or
91+
:class:`~python:io.StringIO`).
92+
sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_table`
93+
Delimiter to use. If sep is ``None``,
94+
will try to automatically determine this. Regular expressions are accepted,
95+
use of a regular expression will force use of the python parsing engine and
96+
will ignore quotes in the data.
97+
delimiter : str, default ``None``
98+
Alternative argument name for sep.
99+
100+
Column and Index Locations and Names
101+
++++++++++++++++++++++++++++++++++++
102+
103+
header : int or list of ints, default ``'infer'``
104+
Row number(s) to use as the column names, and the start of the data. Default
105+
behavior is as if ``header=0`` if no ``names`` passed, otherwise as if
106+
``header=None``. Explicitly pass ``header=0`` to be able to replace existing
107+
names. The header can be a list of ints that specify row locations for a
108+
multi-index on the columns e.g. ``[0,1,3]``. Intervening rows that are not
109+
specified will be skipped (e.g. 2 in this example is skipped). Note that
110+
this parameter ignores commented lines and empty lines if
111+
``skip_blank_lines=True``, so header=0 denotes the first line of data
112+
rather than the first line of the file.
113+
names : array-like, default ``None``
114+
List of column names to use. If file contains no header row, then you should
115+
explicitly pass ``header=None``.
116+
index_col : int or sequence or ``False``, default ``None``
117+
Column to use as the row labels of the DataFrame. If a sequence is given, a
118+
MultiIndex is used. If you have a malformed file with delimiters at the end of
119+
each line, you might consider ``index_col=False`` to force pandas to *not* use
120+
the first column as the index (row names).
121+
usecols : array-like, default ``None``
122+
Return a subset of the columns. Results in much faster parsing time and lower
123+
memory usage
124+
squeeze : boolean, default ``False``
125+
If the parsed data only contains one column then return a Series.
126+
prefix : str, default ``None``
127+
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
128+
mangle_dupe_cols : boolean, default ``True``
129+
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'.
130+
131+
General Parsing Configuration
132+
+++++++++++++++++++++++++++++
133+
134+
dtype : Type name or dict of column -> type, default ``None``
135+
Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32}``
136+
(unsupported with ``engine='python'``). Use `str` or `object` to preserve and
137+
not interpret dtype.
138+
engine : {``'c'``, ``'python'``}
139+
Parser engine to use. The C engine is faster while the python engine is
140+
currently more feature-complete.
141+
converters : dict, default ``None``
142+
Dict of functions for converting values in certain columns. Keys can either be
143+
integers or column labels.
144+
true_values : list, default ``None``
145+
Values to consider as ``True``.
146+
false_values : list, default ``None``
147+
Values to consider as ``False``.
148+
skipinitialspace : boolean, default ``False``
149+
Skip spaces after delimiter.
150+
skiprows : list-like or integer, default ``None``
151+
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start
152+
of the file.
153+
skipfooter : int, default ``0``
154+
Number of lines at bottom of file to skip (unsupported with engine='c').
155+
nrows : int, default ``None``
156+
Number of rows of file to read. Useful for reading pieces of large files.
157+
158+
NA and Missing Data Handling
159+
++++++++++++++++++++++++++++
160+
161+
na_values : str, list-like or dict, default ``None``
162+
Additional strings to recognize as NA/NaN. If dict passed, specific per-column
163+
NA values. By default the following values are interpreted as NaN:
164+
``'-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA',
165+
'#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''``.
166+
keep_default_na : boolean, default ``True``
167+
If na_values are specified and keep_default_na is ``False`` the default NaN
168+
values are overridden, otherwise they're appended to.
169+
na_filter : boolean, default ``True``
170+
Detect missing value markers (empty strings and the value of na_values). In
171+
data without any NAs, passing ``na_filter=False`` can improve the performance
172+
of reading a large file.
173+
verbose : boolean, default ``False``
174+
Indicate number of NA values placed in non-numeric columns.
175+
skip_blank_lines : boolean, default ``True``
176+
If ``True``, skip over blank lines rather than interpreting as NaN values.
177+
178+
Datetime Handling
179+
+++++++++++++++++
180+
181+
parse_dates : boolean or list of ints or names or list of lists or dict, default ``False``.
182+
- If ``True`` -> try parsing the index.
183+
- If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date
184+
column.
185+
- If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date
186+
column.
187+
- If ``{'foo' : [1, 3]}`` -> parse columns 1, 3 as date and call result 'foo'.
188+
A fast-path exists for iso8601-formatted dates.
189+
infer_datetime_format : boolean, default ``False``
190+
If ``True`` and parse_dates is enabled for a column, attempt to infer the
191+
datetime format to speed up the processing.
192+
keep_date_col : boolean, default ``False``
193+
If ``True`` and parse_dates specifies combining multiple columns then keep the
194+
original columns.
195+
date_parser : function, default ``None``
196+
Function to use for converting a sequence of string columns to an array of
197+
datetime instances. The default uses ``dateutil.parser.parser`` to do the
198+
conversion. Pandas will try to call date_parser in three different ways,
199+
advancing to the next if an exception occurs: 1) Pass one or more arrays (as
200+
defined by parse_dates) as arguments; 2) concatenate (row-wise) the string
201+
values from the columns defined by parse_dates into a single array and pass
202+
that; and 3) call date_parser once for each row using one or more strings
203+
(corresponding to the columns defined by parse_dates) as arguments.
204+
dayfirst : boolean, default ``False``
205+
DD/MM format dates, international and European format.
206+
207+
Iteration
208+
+++++++++
209+
210+
iterator : boolean, default ``False``
211+
Return `TextFileReader` object for iteration or getting chunks with
212+
``get_chunk()``.
213+
chunksize : int, default ``None``
214+
Return `TextFileReader` object for iteration. See :ref:`iterating and chunking
215+
<io.chunking>` below.
216+
217+
Quoting, Compression, and File Format
218+
+++++++++++++++++++++++++++++++++++++
219+
220+
compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``None``}, default ``'infer'``
221+
For on-the-fly decompression of on-disk data. If 'infer', then use gzip or bz2
222+
if filepath_or_buffer is a string ending in '.gz' or '.bz2', respectively, and
223+
no decompression otherwise. Set to ``None`` for no decompression.
224+
thousands : str, default ``None``
225+
Thousands separator.
226+
decimal : str, default ``'.'``
227+
Character to recognize as decimal point. E.g. use ``','`` for European data.
228+
lineterminator : str (length 1), default ``None``
229+
Character to break file into lines. Only valid with C parser.
230+
quotechar : str (length 1)
231+
The character used to denote the start and end of a quoted item. Quoted items
232+
can include the delimiter and it will be ignored.
233+
quoting : int or ``csv.QUOTE_*`` instance, default ``None``
234+
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
235+
``QUOTE_MINIMAL`` (0), ``QUOTE_ALL`` (1), ``QUOTE_NONNUMERIC`` (2) or
236+
``QUOTE_NONE`` (3). Default (``None``) results in ``QUOTE_MINIMAL``
237+
behavior.
238+
escapechar : str (length 1), default ``None``
239+
One-character string used to escape delimiter when quoting is ``QUOTE_NONE``.
240+
comment : str, default ``None``
241+
Indicates remainder of line should not be parsed. If found at the beginning of
242+
a line, the line will be ignored altogether. This parameter must be a single
243+
character. Like empty lines (as long as ``skip_blank_lines=True``), fully
244+
commented lines are ignored by the parameter `header` but not by `skiprows`.
245+
For example, if ``comment='#'``, parsing '#empty\\na,b,c\\n1,2,3' with
246+
`header=0` will result in 'a,b,c' being treated as the header.
247+
encoding : str, default ``None``
248+
Encoding to use for UTF when reading/writing (e.g. ``'utf-8'``). `List of
249+
Python standard encodings
250+
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_.
251+
dialect : str or :class:`python:csv.Dialect` instance, default ``None``
252+
If ``None`` defaults to Excel dialect. Ignored if sep longer than 1 char. See
253+
:class:`python:csv.Dialect` documentation for more details.
254+
tupleize_cols : boolean, default ``False``
255+
Leave a list of tuples on columns as is (default is to convert to a MultiIndex
256+
on the columns).
257+
258+
Error Handling
259+
++++++++++++++
79260

80-
They can take a number of arguments:
81-
82-
- ``filepath_or_buffer``: Either a path to a file (a :class:`python:str`,
83-
:class:`python:pathlib.Path`, or :class:`py:py._path.local.LocalPath`), URL
84-
(including http, ftp, and S3 locations), or any object with a ``read``
85-
method (such as an open file or :class:`~python:io.StringIO`).
86-
- ``sep`` or ``delimiter``: A delimiter / separator to split fields
87-
on. With ``sep=None``, ``read_csv`` will try to infer the delimiter
88-
automatically in some cases by "sniffing".
89-
The separator may be specified as a regular expression; for instance
90-
you may use '\|\\s*' to indicate a pipe plus arbitrary whitespace, but ignores quotes in the data when a regex is used in separator.
91-
- ``delim_whitespace``: Parse whitespace-delimited (spaces or tabs) file
92-
(much faster than using a regular expression)
93-
- ``compression``: decompress ``'gzip'`` and ``'bz2'`` formats on the fly.
94-
Set to ``'infer'`` (the default) to guess a format based on the file
95-
extension.
96-
- ``dialect``: string or :class:`python:csv.Dialect` instance to expose more
97-
ways to specify the file format
98-
- ``dtype``: A data type name or a dict of column name to data type. If not
99-
specified, data types will be inferred. (Unsupported with
100-
``engine='python'``)
101-
- ``header``: row number(s) to use as the column names, and the start of the
102-
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
103-
pass ``header=0`` to be able to replace existing names. The header can be
104-
a list of integers that specify row locations for a multi-index on the columns
105-
E.g. [0,1,3]. Intervening rows that are not specified will be
106-
skipped (e.g. 2 in this example are skipped). Note that this parameter
107-
ignores commented lines and empty lines if ``skip_blank_lines=True`` (the default),
108-
so header=0 denotes the first line of data rather than the first line of the file.
109-
- ``skip_blank_lines``: whether to skip over blank lines rather than interpreting
110-
them as NaN values
111-
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
112-
also be an integer to skip the first ``n`` rows
113-
- ``index_col``: column number, column name, or list of column numbers/names,
114-
to use as the ``index`` (row labels) of the resulting DataFrame. By default,
115-
it will number the rows without using any column, unless there is one more
116-
data column than there are headers, in which case the first column is taken
117-
as the index.
118-
- ``names``: List of column names to use as column names. To replace header
119-
existing in file, explicitly pass ``header=0``.
120-
- ``na_values``: optional string or list of strings to recognize as NaN (missing
121-
values), either in addition to or in lieu of the default set.
122-
- ``true_values``: list of strings to recognize as ``True``
123-
- ``false_values``: list of strings to recognize as ``False``
124-
- ``keep_default_na``: whether to include the default set of missing values
125-
in addition to the ones specified in ``na_values``
126-
- ``parse_dates``: if True then index will be parsed as dates
127-
(False by default). You can specify more complicated options to parse
128-
a subset of columns or a combination of columns into a single date column
129-
(list of ints or names, list of lists, or dict)
130-
[1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column
131-
[[1, 3]] -> combine columns 1 and 3 and parse as a single date column
132-
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
133-
- ``keep_date_col``: if True, then date component columns passed into
134-
``parse_dates`` will be retained in the output (False by default).
135-
- ``date_parser``: function to use to parse strings into datetime
136-
objects. If ``parse_dates`` is True, it defaults to the very robust
137-
``dateutil.parser``. Specifying this implicitly sets ``parse_dates`` as True.
138-
You can also use functions from community supported date converters from
139-
date_converters.py
140-
- ``dayfirst``: if True then uses the DD/MM international/European date format
141-
(This is False by default)
142-
- ``thousands``: specifies the thousands separator. If not None, this character will
143-
be stripped from numeric dtypes. However, if it is the first character in a field,
144-
that column will be imported as a string. In the PythonParser, if not None,
145-
then parser will try to look for it in the output and parse relevant data to numeric
146-
dtypes. Because it has to essentially scan through the data again, this causes a
147-
significant performance hit so only use if necessary.
148-
- ``lineterminator`` : string (length 1), default ``None``, Character to break file into lines. Only valid with C parser
149-
- ``quotechar`` : string, The character to used to denote the start and end of a quoted item.
150-
Quoted items can include the delimiter and it will be ignored.
151-
- ``quoting`` : int,
152-
Controls whether quotes should be recognized. Values are taken from `csv.QUOTE_*` values.
153-
Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL,
154-
QUOTE_NONNUMERIC and QUOTE_NONE, respectively.
155-
- ``skipinitialspace`` : boolean, default ``False``, Skip spaces after delimiter
156-
- ``escapechar`` : string, to specify how to escape quoted data
157-
- ``comment``: Indicates remainder of line should not be parsed. If found at the
158-
beginning of a line, the line will be ignored altogether. This parameter
159-
must be a single character. Like empty lines, fully commented lines
160-
are ignored by the parameter `header` but not by `skiprows`. For example,
161-
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
162-
result in '1,2,3' being treated as the header.
163-
- ``nrows``: Number of rows to read out of the file. Useful to only read a
164-
small portion of a large file
165-
- ``iterator``: If True, return a ``TextFileReader`` to enable reading a file
166-
into memory piece by piece
167-
- ``chunksize``: An number of rows to be used to "chunk" a file into
168-
pieces. Will cause an ``TextFileReader`` object to be returned. More on this
169-
below in the section on :ref:`iterating and chunking <io.chunking>`
170-
- ``skip_footer``: number of lines to skip at bottom of file (default 0)
171-
(Unsupported with ``engine='c'``)
172-
- ``converters``: a dictionary of functions for converting values in certain
173-
columns, where keys are either integers or column labels
174-
- ``encoding``: a string representing the encoding to use for decoding
175-
unicode data, e.g. ``'utf-8``` or ``'latin-1'``. `Full list of Python
176-
standard encodings
177-
<https://docs.python.org/3/library/codecs.html#standard-encodings>`_
178-
- ``verbose``: show number of NA values inserted in non-numeric columns
179-
- ``squeeze``: if True then output with only one column is turned into Series
180-
- ``error_bad_lines``: if False then any lines causing an error will be skipped :ref:`bad lines <io.bad_lines>`
181-
- ``usecols``: a subset of columns to return, results in much faster parsing
182-
time and lower memory usage.
183-
- ``mangle_dupe_cols``: boolean, default True, then duplicate columns will be specified
184-
as 'X.0'...'X.N', rather than 'X'...'X'
185-
- ``tupleize_cols``: boolean, default False, if False, convert a list of tuples
186-
to a multi-index of columns, otherwise, leave the column index as a list of
187-
tuples
188-
- ``float_precision`` : string, default None. Specifies which converter the C
189-
engine should use for floating-point values. The options are None for the
190-
ordinary converter, 'high' for the high-precision converter, and
191-
'round_trip' for the round-trip converter.
261+
error_bad_lines : boolean, default ``True``
262+
Lines with too many fields (e.g. a csv line with too many commas) will by
263+
default cause an exception to be raised, and no DataFrame will be returned. If
264+
``False``, then these "bad lines" will dropped from the DataFrame that is
265+
returned (only valid with C parser). See :ref:`bad lines <io.bad_lines>`
266+
below.
267+
warn_bad_lines : boolean, default ``True``
268+
If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for
269+
each "bad line" will be output (only valid with C parser).
192270

193271
.. ipython:: python
194272
:suppress:
@@ -500,11 +578,10 @@ Date Handling
500578
Specifying Date Columns
501579
+++++++++++++++++++++++
502580

503-
To better facilitate working with datetime data,
504-
:func:`~pandas.io.parsers.read_csv` and :func:`~pandas.io.parsers.read_table`
505-
uses the keyword arguments ``parse_dates`` and ``date_parser`` to allow users
506-
to specify a variety of columns and date/time formats to turn the input text
507-
data into ``datetime`` objects.
581+
To better facilitate working with datetime data, :func:`read_csv` and
582+
:func:`read_table` use the keyword arguments ``parse_dates`` and ``date_parser``
583+
to allow users to specify a variety of columns and date/time formats to turn the
584+
input text data into ``datetime`` objects.
508585

509586
The simplest case is to just pass in ``parse_dates=True``:
510587

@@ -929,10 +1006,9 @@ should pass the ``escapechar`` option:
9291006
Files with Fixed Width Columns
9301007
''''''''''''''''''''''''''''''
9311008

932-
While ``read_csv`` reads delimited data, the :func:`~pandas.io.parsers.read_fwf`
933-
function works with data files that have known and fixed column widths.
934-
The function parameters to ``read_fwf`` are largely the same as `read_csv` with
935-
two extra parameters:
1009+
While ``read_csv`` reads delimited data, the :func:`read_fwf` function works
1010+
with data files that have known and fixed column widths. The function parameters
1011+
to ``read_fwf`` are largely the same as `read_csv` with two extra parameters:
9361012

9371013
- ``colspecs``: A list of pairs (tuples) giving the extents of the
9381014
fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).

0 commit comments

Comments
 (0)