@@ -72,123 +72,201 @@ CSV & Text files
72
72
----------------
73
73
74
74
The two workhorse functions for reading text files (a.k.a. flat files) are
75
- :func: `~pandas.io.parsers.read_csv ` and :func: `~pandas.io.parsers.read_table `.
76
- They both use the same parsing code to intelligently convert tabular
77
- data into a DataFrame object. See the :ref: `cookbook<cookbook.csv> `
78
- for some advanced strategies
75
+ :func: `read_csv ` and :func: `read_table `. They both use the same parsing code to
76
+ intelligently convert tabular data into a DataFrame object. See the
77
+ :ref: `cookbook<cookbook.csv> ` for some advanced strategies.
78
+
79
+ Parsing options
80
+ '''''''''''''''
81
+
82
+ :func: `read_csv ` and :func: `read_table ` accept the following arguments:
83
+
84
+ Basic
85
+ +++++
86
+
87
+ filepath_or_buffer : various
88
+ Either a path to a file (a :class: `python:str `, :class: `python:pathlib.Path `,
89
+ or :class: `py:py._path.local.LocalPath `), URL (including http, ftp, and S3
90
+ locations), or any object with a ``read() `` method (such as an open file or
91
+ :class: `~python:io.StringIO `).
92
+ sep : str, defaults to ``',' `` for :func: `read_csv `, ``\t `` for :func: `read_table `
93
+ Delimiter to use. If sep is ``None ``,
94
+ will try to automatically determine this. Regular expressions are accepted,
95
+ use of a regular expression will force use of the python parsing engine and
96
+ will ignore quotes in the data.
97
+ delimiter : str, default ``None ``
98
+ Alternative argument name for sep.
99
+
100
+ Column and Index Locations and Names
101
+ ++++++++++++++++++++++++++++++++++++
102
+
103
+ header : int or list of ints, default ``'infer' ``
104
+ Row number(s) to use as the column names, and the start of the data. Default
105
+ behavior is as if ``header=0 `` if no ``names `` passed, otherwise as if
106
+ ``header=None ``. Explicitly pass ``header=0 `` to be able to replace existing
107
+ names. The header can be a list of ints that specify row locations for a
108
+ multi-index on the columns e.g. ``[0,1,3] ``. Intervening rows that are not
109
+ specified will be skipped (e.g. 2 in this example is skipped). Note that
110
+ this parameter ignores commented lines and empty lines if
111
+ ``skip_blank_lines=True ``, so header=0 denotes the first line of data
112
+ rather than the first line of the file.
113
+ names : array-like, default ``None ``
114
+ List of column names to use. If file contains no header row, then you should
115
+ explicitly pass ``header=None ``.
116
+ index_col : int or sequence or ``False ``, default ``None ``
117
+ Column to use as the row labels of the DataFrame. If a sequence is given, a
118
+ MultiIndex is used. If you have a malformed file with delimiters at the end of
119
+ each line, you might consider ``index_col=False `` to force pandas to *not * use
120
+ the first column as the index (row names).
121
+ usecols : array-like, default ``None ``
122
+ Return a subset of the columns. Results in much faster parsing time and lower
123
+ memory usage
124
+ squeeze : boolean, default ``False ``
125
+ If the parsed data only contains one column then return a Series.
126
+ prefix : str, default ``None ``
127
+ Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
128
+ mangle_dupe_cols : boolean, default ``True ``
129
+ Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'.
130
+
131
+ General Parsing Configuration
132
+ +++++++++++++++++++++++++++++
133
+
134
+ dtype : Type name or dict of column -> type, default ``None ``
135
+ Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32} ``
136
+ (unsupported with ``engine='python' ``). Use `str ` or `object ` to preserve and
137
+ not interpret dtype.
138
+ engine : {``'c' ``, ``'python' ``}
139
+ Parser engine to use. The C engine is faster while the python engine is
140
+ currently more feature-complete.
141
+ converters : dict, default ``None ``
142
+ Dict of functions for converting values in certain columns. Keys can either be
143
+ integers or column labels.
144
+ true_values : list, default ``None ``
145
+ Values to consider as ``True ``.
146
+ false_values : list, default ``None ``
147
+ Values to consider as ``False ``.
148
+ skipinitialspace : boolean, default ``False ``
149
+ Skip spaces after delimiter.
150
+ skiprows : list-like or integer, default ``None ``
151
+ Line numbers to skip (0-indexed) or number of lines to skip (int) at the start
152
+ of the file.
153
+ skipfooter : int, default ``0 ``
154
+ Number of lines at bottom of file to skip (unsupported with engine='c').
155
+ nrows : int, default ``None ``
156
+ Number of rows of file to read. Useful for reading pieces of large files.
157
+
158
+ NA and Missing Data Handling
159
+ ++++++++++++++++++++++++++++
160
+
161
+ na_values : str, list-like or dict, default ``None ``
162
+ Additional strings to recognize as NA/NaN. If dict passed, specific per-column
163
+ NA values. By default the following values are interpreted as NaN:
164
+ ``'-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA',
165
+ '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', '' ``.
166
+ keep_default_na : boolean, default ``True ``
167
+ If na_values are specified and keep_default_na is ``False `` the default NaN
168
+ values are overridden, otherwise they're appended to.
169
+ na_filter : boolean, default ``True ``
170
+ Detect missing value markers (empty strings and the value of na_values). In
171
+ data without any NAs, passing ``na_filter=False `` can improve the performance
172
+ of reading a large file.
173
+ verbose : boolean, default ``False ``
174
+ Indicate number of NA values placed in non-numeric columns.
175
+ skip_blank_lines : boolean, default ``True ``
176
+ If ``True ``, skip over blank lines rather than interpreting as NaN values.
177
+
178
+ Datetime Handling
179
+ +++++++++++++++++
180
+
181
+ parse_dates : boolean or list of ints or names or list of lists or dict, default ``False ``.
182
+ - If ``True `` -> try parsing the index.
183
+ - If ``[1, 2, 3] `` -> try parsing columns 1, 2, 3 each as a separate date
184
+ column.
185
+ - If ``[[1, 3]] `` -> combine columns 1 and 3 and parse as a single date
186
+ column.
187
+ - If ``{'foo' : [1, 3]} `` -> parse columns 1, 3 as date and call result 'foo'.
188
+ A fast-path exists for iso8601-formatted dates.
189
+ infer_datetime_format : boolean, default ``False ``
190
+ If ``True `` and parse_dates is enabled for a column, attempt to infer the
191
+ datetime format to speed up the processing.
192
+ keep_date_col : boolean, default ``False ``
193
+ If ``True `` and parse_dates specifies combining multiple columns then keep the
194
+ original columns.
195
+ date_parser : function, default ``None ``
196
+ Function to use for converting a sequence of string columns to an array of
197
+ datetime instances. The default uses ``dateutil.parser.parser `` to do the
198
+ conversion. Pandas will try to call date_parser in three different ways,
199
+ advancing to the next if an exception occurs: 1) Pass one or more arrays (as
200
+ defined by parse_dates) as arguments; 2) concatenate (row-wise) the string
201
+ values from the columns defined by parse_dates into a single array and pass
202
+ that; and 3) call date_parser once for each row using one or more strings
203
+ (corresponding to the columns defined by parse_dates) as arguments.
204
+ dayfirst : boolean, default ``False ``
205
+ DD/MM format dates, international and European format.
206
+
207
+ Iteration
208
+ +++++++++
209
+
210
+ iterator : boolean, default ``False ``
211
+ Return `TextFileReader ` object for iteration or getting chunks with
212
+ ``get_chunk() ``.
213
+ chunksize : int, default ``None ``
214
+ Return `TextFileReader ` object for iteration. See :ref: `iterating and chunking
215
+ <io.chunking>` below.
216
+
217
+ Quoting, Compression, and File Format
218
+ +++++++++++++++++++++++++++++++++++++
219
+
220
+ compression : {``'infer' ``, ``'gzip' ``, ``'bz2' ``, ``None ``}, default ``'infer' ``
221
+ For on-the-fly decompression of on-disk data. If 'infer', then use gzip or bz2
222
+ if filepath_or_buffer is a string ending in '.gz' or '.bz2', respectively, and
223
+ no decompression otherwise. Set to ``None `` for no decompression.
224
+ thousands : str, default ``None ``
225
+ Thousands separator.
226
+ decimal : str, default ``'.' ``
227
+ Character to recognize as decimal point. E.g. use ``',' `` for European data.
228
+ lineterminator : str (length 1), default ``None ``
229
+ Character to break file into lines. Only valid with C parser.
230
+ quotechar : str (length 1)
231
+ The character used to denote the start and end of a quoted item. Quoted items
232
+ can include the delimiter and it will be ignored.
233
+ quoting : int or ``csv.QUOTE_* `` instance, default ``None ``
234
+ Control field quoting behavior per ``csv.QUOTE_* `` constants. Use one of
235
+ ``QUOTE_MINIMAL `` (0), ``QUOTE_ALL `` (1), ``QUOTE_NONNUMERIC `` (2) or
236
+ ``QUOTE_NONE `` (3). Default (``None ``) results in ``QUOTE_MINIMAL ``
237
+ behavior.
238
+ escapechar : str (length 1), default ``None ``
239
+ One-character string used to escape delimiter when quoting is ``QUOTE_NONE ``.
240
+ comment : str, default ``None ``
241
+ Indicates remainder of line should not be parsed. If found at the beginning of
242
+ a line, the line will be ignored altogether. This parameter must be a single
243
+ character. Like empty lines (as long as ``skip_blank_lines=True ``), fully
244
+ commented lines are ignored by the parameter `header ` but not by `skiprows `.
245
+ For example, if ``comment='#' ``, parsing '#empty\\ na,b,c\\ n1,2,3' with
246
+ `header=0 ` will result in 'a,b,c' being treated as the header.
247
+ encoding : str, default ``None ``
248
+ Encoding to use for UTF when reading/writing (e.g. ``'utf-8' ``). `List of
249
+ Python standard encodings
250
+ <https://docs.python.org/3/library/codecs.html#standard-encodings> `_.
251
+ dialect : str or :class: `python:csv.Dialect ` instance, default ``None ``
252
+ If ``None `` defaults to Excel dialect. Ignored if sep longer than 1 char. See
253
+ :class: `python:csv.Dialect ` documentation for more details.
254
+ tupleize_cols : boolean, default ``False ``
255
+ Leave a list of tuples on columns as is (default is to convert to a MultiIndex
256
+ on the columns).
257
+
258
+ Error Handling
259
+ ++++++++++++++
79
260
80
- They can take a number of arguments:
81
-
82
- - ``filepath_or_buffer ``: Either a path to a file (a :class: `python:str `,
83
- :class: `python:pathlib.Path `, or :class: `py:py._path.local.LocalPath `), URL
84
- (including http, ftp, and S3 locations), or any object with a ``read ``
85
- method (such as an open file or :class: `~python:io.StringIO `).
86
- - ``sep `` or ``delimiter ``: A delimiter / separator to split fields
87
- on. With ``sep=None ``, ``read_csv `` will try to infer the delimiter
88
- automatically in some cases by "sniffing".
89
- The separator may be specified as a regular expression; for instance
90
- you may use '\|\\ s*' to indicate a pipe plus arbitrary whitespace, but ignores quotes in the data when a regex is used in separator.
91
- - ``delim_whitespace ``: Parse whitespace-delimited (spaces or tabs) file
92
- (much faster than using a regular expression)
93
- - ``compression ``: decompress ``'gzip' `` and ``'bz2' `` formats on the fly.
94
- Set to ``'infer' `` (the default) to guess a format based on the file
95
- extension.
96
- - ``dialect ``: string or :class: `python:csv.Dialect ` instance to expose more
97
- ways to specify the file format
98
- - ``dtype ``: A data type name or a dict of column name to data type. If not
99
- specified, data types will be inferred. (Unsupported with
100
- ``engine='python' ``)
101
- - ``header ``: row number(s) to use as the column names, and the start of the
102
- data. Defaults to 0 if no ``names `` passed, otherwise ``None ``. Explicitly
103
- pass ``header=0 `` to be able to replace existing names. The header can be
104
- a list of integers that specify row locations for a multi-index on the columns
105
- E.g. [0,1,3]. Intervening rows that are not specified will be
106
- skipped (e.g. 2 in this example are skipped). Note that this parameter
107
- ignores commented lines and empty lines if ``skip_blank_lines=True `` (the default),
108
- so header=0 denotes the first line of data rather than the first line of the file.
109
- - ``skip_blank_lines ``: whether to skip over blank lines rather than interpreting
110
- them as NaN values
111
- - ``skiprows ``: A collection of numbers for rows in the file to skip. Can
112
- also be an integer to skip the first ``n `` rows
113
- - ``index_col ``: column number, column name, or list of column numbers/names,
114
- to use as the ``index `` (row labels) of the resulting DataFrame. By default,
115
- it will number the rows without using any column, unless there is one more
116
- data column than there are headers, in which case the first column is taken
117
- as the index.
118
- - ``names ``: List of column names to use as column names. To replace header
119
- existing in file, explicitly pass ``header=0 ``.
120
- - ``na_values ``: optional string or list of strings to recognize as NaN (missing
121
- values), either in addition to or in lieu of the default set.
122
- - ``true_values ``: list of strings to recognize as ``True ``
123
- - ``false_values ``: list of strings to recognize as ``False ``
124
- - ``keep_default_na ``: whether to include the default set of missing values
125
- in addition to the ones specified in ``na_values ``
126
- - ``parse_dates ``: if True then index will be parsed as dates
127
- (False by default). You can specify more complicated options to parse
128
- a subset of columns or a combination of columns into a single date column
129
- (list of ints or names, list of lists, or dict)
130
- [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column
131
- [[1, 3]] -> combine columns 1 and 3 and parse as a single date column
132
- {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
133
- - ``keep_date_col ``: if True, then date component columns passed into
134
- ``parse_dates `` will be retained in the output (False by default).
135
- - ``date_parser ``: function to use to parse strings into datetime
136
- objects. If ``parse_dates `` is True, it defaults to the very robust
137
- ``dateutil.parser ``. Specifying this implicitly sets ``parse_dates `` as True.
138
- You can also use functions from community supported date converters from
139
- date_converters.py
140
- - ``dayfirst ``: if True then uses the DD/MM international/European date format
141
- (This is False by default)
142
- - ``thousands ``: specifies the thousands separator. If not None, this character will
143
- be stripped from numeric dtypes. However, if it is the first character in a field,
144
- that column will be imported as a string. In the PythonParser, if not None,
145
- then parser will try to look for it in the output and parse relevant data to numeric
146
- dtypes. Because it has to essentially scan through the data again, this causes a
147
- significant performance hit so only use if necessary.
148
- - ``lineterminator `` : string (length 1), default ``None ``, Character to break file into lines. Only valid with C parser
149
- - ``quotechar `` : string, The character to used to denote the start and end of a quoted item.
150
- Quoted items can include the delimiter and it will be ignored.
151
- - ``quoting `` : int,
152
- Controls whether quotes should be recognized. Values are taken from `csv.QUOTE_* ` values.
153
- Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL,
154
- QUOTE_NONNUMERIC and QUOTE_NONE, respectively.
155
- - ``skipinitialspace `` : boolean, default ``False ``, Skip spaces after delimiter
156
- - ``escapechar `` : string, to specify how to escape quoted data
157
- - ``comment ``: Indicates remainder of line should not be parsed. If found at the
158
- beginning of a line, the line will be ignored altogether. This parameter
159
- must be a single character. Like empty lines, fully commented lines
160
- are ignored by the parameter `header ` but not by `skiprows `. For example,
161
- if comment='#', parsing '#empty\n 1,2,3\n a,b,c' with `header=0 ` will
162
- result in '1,2,3' being treated as the header.
163
- - ``nrows ``: Number of rows to read out of the file. Useful to only read a
164
- small portion of a large file
165
- - ``iterator ``: If True, return a ``TextFileReader `` to enable reading a file
166
- into memory piece by piece
167
- - ``chunksize ``: An number of rows to be used to "chunk" a file into
168
- pieces. Will cause an ``TextFileReader `` object to be returned. More on this
169
- below in the section on :ref: `iterating and chunking <io.chunking >`
170
- - ``skip_footer ``: number of lines to skip at bottom of file (default 0)
171
- (Unsupported with ``engine='c' ``)
172
- - ``converters ``: a dictionary of functions for converting values in certain
173
- columns, where keys are either integers or column labels
174
- - ``encoding ``: a string representing the encoding to use for decoding
175
- unicode data, e.g. ``'utf-8` `` or ``'latin-1' ``. `Full list of Python
176
- standard encodings
177
- <https://docs.python.org/3/library/codecs.html#standard-encodings> `_
178
- - ``verbose ``: show number of NA values inserted in non-numeric columns
179
- - ``squeeze ``: if True then output with only one column is turned into Series
180
- - ``error_bad_lines ``: if False then any lines causing an error will be skipped :ref: `bad lines <io.bad_lines >`
181
- - ``usecols ``: a subset of columns to return, results in much faster parsing
182
- time and lower memory usage.
183
- - ``mangle_dupe_cols ``: boolean, default True, then duplicate columns will be specified
184
- as 'X.0'...'X.N', rather than 'X'...'X'
185
- - ``tupleize_cols ``: boolean, default False, if False, convert a list of tuples
186
- to a multi-index of columns, otherwise, leave the column index as a list of
187
- tuples
188
- - ``float_precision `` : string, default None. Specifies which converter the C
189
- engine should use for floating-point values. The options are None for the
190
- ordinary converter, 'high' for the high-precision converter, and
191
- 'round_trip' for the round-trip converter.
261
+ error_bad_lines : boolean, default ``True ``
262
+ Lines with too many fields (e.g. a csv line with too many commas) will by
263
+ default cause an exception to be raised, and no DataFrame will be returned. If
264
+ ``False ``, then these "bad lines" will dropped from the DataFrame that is
265
+ returned (only valid with C parser). See :ref: `bad lines <io.bad_lines >`
266
+ below.
267
+ warn_bad_lines : boolean, default ``True ``
268
+ If error_bad_lines is ``False ``, and warn_bad_lines is ``True ``, a warning for
269
+ each "bad line" will be output (only valid with C parser).
192
270
193
271
.. ipython :: python
194
272
:suppress:
@@ -500,11 +578,10 @@ Date Handling
500
578
Specifying Date Columns
501
579
+++++++++++++++++++++++
502
580
503
- To better facilitate working with datetime data,
504
- :func: `~pandas.io.parsers.read_csv ` and :func: `~pandas.io.parsers.read_table `
505
- uses the keyword arguments ``parse_dates `` and ``date_parser `` to allow users
506
- to specify a variety of columns and date/time formats to turn the input text
507
- data into ``datetime `` objects.
581
+ To better facilitate working with datetime data, :func: `read_csv ` and
582
+ :func: `read_table ` use the keyword arguments ``parse_dates `` and ``date_parser ``
583
+ to allow users to specify a variety of columns and date/time formats to turn the
584
+ input text data into ``datetime `` objects.
508
585
509
586
The simplest case is to just pass in ``parse_dates=True ``:
510
587
@@ -929,10 +1006,9 @@ should pass the ``escapechar`` option:
929
1006
Files with Fixed Width Columns
930
1007
''''''''''''''''''''''''''''''
931
1008
932
- While ``read_csv `` reads delimited data, the :func: `~pandas.io.parsers.read_fwf `
933
- function works with data files that have known and fixed column widths.
934
- The function parameters to ``read_fwf `` are largely the same as `read_csv ` with
935
- two extra parameters:
1009
+ While ``read_csv `` reads delimited data, the :func: `read_fwf ` function works
1010
+ with data files that have known and fixed column widths. The function parameters
1011
+ to ``read_fwf `` are largely the same as `read_csv ` with two extra parameters:
936
1012
937
1013
- ``colspecs ``: A list of pairs (tuples) giving the extents of the
938
1014
fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).
0 commit comments