1
1
"""
2
2
Module contains tools for processing files into DataFrames or other objects
3
3
"""
4
+
4
5
from __future__ import print_function
5
6
6
7
from collections import defaultdict
71
72
By file-like object, we refer to objects with a ``read()`` method, such as
72
73
a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
73
74
%s
74
- delim_whitespace : boolean , default False
75
+ delim_whitespace : bool , default False
75
76
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be
76
77
used as the sep. Equivalent to setting ``sep='\s+'``. If this option
77
78
is set to True, nothing should be passed in for the ``delimiter``
101
102
Column to use as the row labels of the DataFrame. If a sequence is given, a
102
103
MultiIndex is used. If you have a malformed file with delimiters at the end
103
104
of each line, you might consider index_col=False to force pandas to _not_
104
- use the first column as the index (row names)
105
+ use the first column as the index (row names).
105
106
usecols : list-like or callable, default None
106
107
Return a subset of the columns. If list-like, all elements must either
107
108
be positional (i.e. integer indices into the document columns) or strings
120
121
example of a valid callable argument would be ``lambda x: x.upper() in
121
122
['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
122
123
parsing time and lower memory usage.
123
- squeeze : boolean , default False
124
- If the parsed data only contains one column then return a Series
124
+ squeeze : bool , default False
125
+ If the parsed data only contains one column then return a Series.
125
126
prefix : str, default None
126
127
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
127
- mangle_dupe_cols : boolean , default True
128
+ mangle_dupe_cols : bool , default True
128
129
Duplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than
129
130
'X'...'X'. Passing in False will cause data to be overwritten if there
130
131
are duplicate names in the columns.
137
138
%s
138
139
converters : dict, default None
139
140
Dict of functions for converting values in certain columns. Keys can either
140
- be integers or column labels
141
+ be integers or column labels.
141
142
true_values : list, default None
142
- Values to consider as True
143
+ Values to consider as True.
143
144
false_values : list, default None
144
- Values to consider as False
145
- skipinitialspace : boolean , default False
145
+ Values to consider as False.
146
+ skipinitialspace : bool , default False
146
147
Skip spaces after delimiter.
147
- skiprows : list-like or integer or callable, default None
148
+ skiprows : list-like or int or callable, default None
148
149
Line numbers to skip (0-indexed) or number of lines to skip (int)
149
150
at the start of the file.
150
151
151
152
If callable, the callable function will be evaluated against the row
152
153
indices, returning True if the row should be skipped and False otherwise.
153
154
An example of a valid callable argument would be ``lambda x: x in [0, 2]``.
154
155
skipfooter : int, default 0
155
- Number of lines at bottom of file to skip (Unsupported with engine='c')
156
+ Number of lines at bottom of file to skip (Unsupported with engine='c').
156
157
nrows : int, default None
157
- Number of rows of file to read. Useful for reading pieces of large files
158
+ Number of rows of file to read. Useful for reading pieces of large files.
158
159
na_values : scalar, str, list-like, or dict, default None
159
160
Additional strings to recognize as NA/NaN. If dict passed, specific
160
161
per-column NA values. By default the following values are interpreted as
175
176
176
177
Note that if `na_filter` is passed in as False, the `keep_default_na` and
177
178
`na_values` parameters will be ignored.
178
- na_filter : boolean , default True
179
+ na_filter : bool , default True
179
180
Detect missing value markers (empty strings and the value of na_values). In
180
181
data without any NAs, passing na_filter=False can improve the performance
181
- of reading a large file
182
- verbose : boolean , default False
183
- Indicate number of NA values placed in non-numeric columns
184
- skip_blank_lines : boolean , default True
185
- If True, skip over blank lines rather than interpreting as NaN values
186
- parse_dates : boolean or list of ints or names or list of lists or dict, \
182
+ of reading a large file.
183
+ verbose : bool , default False
184
+ Indicate number of NA values placed in non-numeric columns.
185
+ skip_blank_lines : bool , default True
186
+ If True, skip over blank lines rather than interpreting as NaN values.
187
+ parse_dates : bool or list of ints or names or list of lists or dict, \
187
188
default False
189
+ The behavior is as follows:
188
190
189
191
* boolean. If True -> try parsing the index.
190
192
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
199
201
datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``
200
202
201
203
Note: A fast-path exists for iso8601-formatted dates.
202
- infer_datetime_format : boolean , default False
204
+ infer_datetime_format : bool , default False
203
205
If True and `parse_dates` is enabled, pandas will attempt to infer the
204
206
format of the datetime strings in the columns, and if it can be inferred,
205
207
switch to a faster method of parsing them. In some cases this can increase
206
208
the parsing speed by 5-10x.
207
- keep_date_col : boolean , default False
209
+ keep_date_col : bool , default False
208
210
If True and `parse_dates` specifies combining multiple columns then
209
211
keep the original columns.
210
212
date_parser : function, default None
217
219
and pass that; and 3) call `date_parser` once for each row using one or
218
220
more strings (corresponding to the columns defined by `parse_dates`) as
219
221
arguments.
220
- dayfirst : boolean , default False
221
- DD/MM format dates, international and European format
222
- iterator : boolean , default False
222
+ dayfirst : bool , default False
223
+ DD/MM format dates, international and European format.
224
+ iterator : bool , default False
223
225
Return TextFileReader object for iteration or getting chunks with
224
226
``get_chunk()``.
225
227
chunksize : int, default None
237
239
.. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
238
240
239
241
thousands : str, default None
240
- Thousands separator
242
+ Thousands separator.
241
243
decimal : str, default '.'
242
244
Character to recognize as decimal point (e.g. use ',' for European data).
243
- float_precision : string , default None
245
+ float_precision : str , default None
244
246
Specifies which converter the C engine should use for floating-point
245
247
values. The options are `None` for the ordinary converter,
246
248
`high` for the high-precision converter, and `round_trip` for the
253
255
quoting : int or csv.QUOTE_* instance, default 0
254
256
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
255
257
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
256
- doublequote : boolean , default ``True``
258
+ doublequote : bool , default ``True``
257
259
When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
258
260
whether or not to interpret two consecutive quotechar elements INSIDE a
259
261
field as a single ``quotechar`` element.
270
272
encoding : str, default None
271
273
Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
272
274
standard encodings
273
- <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
275
+ <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ .
274
276
dialect : str or csv.Dialect instance, default None
275
277
If provided, this parameter will override values (default or not) for the
276
278
following parameters: `delimiter`, `doublequote`, `escapechar`,
277
279
`skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to
278
280
override values, a ParserWarning will be issued. See csv.Dialect
279
281
documentation for more details.
280
- tupleize_cols : boolean , default False
282
+ tupleize_cols : bool , default False
281
283
.. deprecated:: 0.21.0
282
284
This argument will be removed and will always convert to MultiIndex
283
285
284
286
Leave a list of tuples on columns as is (default is to convert to
285
- a MultiIndex on the columns)
286
- error_bad_lines : boolean , default True
287
+ a MultiIndex on the columns).
288
+ error_bad_lines : bool , default True
287
289
Lines with too many fields (e.g. a csv line with too many commas) will by
288
290
default cause an exception to be raised, and no DataFrame will be returned.
289
291
If False, then these "bad lines" will dropped from the DataFrame that is
290
292
returned.
291
- warn_bad_lines : boolean , default True
293
+ warn_bad_lines : bool , default True
292
294
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
293
295
"bad line" will be output.
294
- low_memory : boolean , default True
296
+ low_memory : bool , default True
295
297
Internally process the file in chunks, resulting in lower memory use
296
298
while parsing, but possibly mixed type inference. To ensure no mixed
297
299
types either set False, or specify the type with the `dtype` parameter.
298
300
Note that the entire file is read into a single DataFrame regardless,
299
301
use the `chunksize` or `iterator` parameter to return the data in chunks.
300
- (Only valid with C parser)
301
- memory_map : boolean , default False
302
+ (Only valid with C parser).
303
+ memory_map : bool , default False
302
304
If a filepath is provided for `filepath_or_buffer`, map the file object
303
305
directly onto memory and access the data directly from there. Using this
304
306
option can improve performance because there is no longer any I/O overhead.
320
322
tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
321
323
different from ``'\s+'`` will be interpreted as regular expressions and
322
324
will also force the use of the Python parsing engine. Note that regex
323
- delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``
325
+ delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
324
326
delimiter : str, default ``None``
325
- Alternative argument name for sep."""
327
+ Alternative argument name for sep.
328
+ """
326
329
327
330
_read_csv_doc = """
328
- Read CSV (comma-separated) file into DataFrame
331
+ Read CSV (comma-separated) file into DataFrame.
329
332
330
333
%s
331
334
""" % (_parser_params % (_sep_doc .format (default = "','" ), _engine_doc ))
@@ -1994,18 +1997,18 @@ def TextParser(*args, **kwds):
1994
1997
rows will be discarded
1995
1998
index_col : int or list, default None
1996
1999
Column or columns to use as the (possibly hierarchical) index
1997
- has_index_names: boolean , default False
2000
+ has_index_names: bool , default False
1998
2001
True if the cols defined in index_col have an index name and are
1999
- not in the header
2002
+ not in the header.
2000
2003
na_values : scalar, str, list-like, or dict, default None
2001
2004
Additional strings to recognize as NA/NaN.
2002
2005
keep_default_na : bool, default True
2003
2006
thousands : str, default None
2004
2007
Thousands separator
2005
2008
comment : str, default None
2006
2009
Comment out remainder of line
2007
- parse_dates : boolean , default False
2008
- keep_date_col : boolean , default False
2010
+ parse_dates : bool , default False
2011
+ keep_date_col : bool , default False
2009
2012
date_parser : function, default None
2010
2013
skiprows : list of integers
2011
2014
Row numbers to skip
@@ -2016,15 +2019,15 @@ def TextParser(*args, **kwds):
2016
2019
either be integers or column labels, values are functions that take one
2017
2020
input argument, the cell (not column) content, and return the
2018
2021
transformed content.
2019
- encoding : string , default None
2022
+ encoding : str , default None
2020
2023
Encoding to use for UTF when reading/writing (ex. 'utf-8')
2021
- squeeze : boolean , default False
2022
- returns Series if only one column
2023
- infer_datetime_format: boolean , default False
2024
+ squeeze : bool , default False
2025
+ returns Series if only one column.
2026
+ infer_datetime_format: bool , default False
2024
2027
If True and `parse_dates` is True for a column, try to infer the
2025
2028
datetime format based on the first datetime string. If the format
2026
2029
can be inferred, there often will be a large parsing speed-up.
2027
- float_precision : string , default None
2030
+ float_precision : str , default None
2028
2031
Specifies which converter the C engine should use for floating-point
2029
2032
values. The options are None for the ordinary converter,
2030
2033
'high' for the high-precision converter, and 'round_trip' for the
0 commit comments