Skip to content

Commit 7168d98

Browse files
committed
Merge pull request #6889 from mcwitt/fix-gh6607
BUG/ENH: Add fallback warnings and correctly handle leading whitespace in C parser
2 parents 759a907 + f45b714 commit 7168d98

File tree

8 files changed

+677
-137
lines changed

8 files changed

+677
-137
lines changed

doc/source/io.rst

+24-1
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,8 @@ They can take a number of arguments:
9292
- ``dialect``: string or :class:`python:csv.Dialect` instance to expose more
9393
ways to specify the file format
9494
- ``dtype``: A data type name or a dict of column name to data type. If not
95-
specified, data types will be inferred.
95+
specified, data types will be inferred. (Unsupported with
96+
``engine='python'``)
9697
- ``header``: row number(s) to use as the column names, and the start of the
9798
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
9899
pass ``header=0`` to be able to replace existing names. The header can be
@@ -154,6 +155,7 @@ They can take a number of arguments:
154155
pieces. Will cause an ``TextFileReader`` object to be returned. More on this
155156
below in the section on :ref:`iterating and chunking <io.chunking>`
156157
- ``skip_footer``: number of lines to skip at bottom of file (default 0)
158+
(Unsupported with ``engine='c'``)
157159
- ``converters``: a dictionary of functions for converting values in certain
158160
columns, where keys are either integers or column labels
159161
- ``encoding``: a string representing the encoding to use for decoding
@@ -275,6 +277,11 @@ individual columns:
275277
df = pd.read_csv(StringIO(data), dtype={'b': object, 'c': np.float64})
276278
df.dtypes
277279
280+
.. note::
281+
The ``dtype`` option is currently only supported by the C engine.
282+
Specifying ``dtype`` with ``engine`` other than 'c' raises a
283+
``ValueError``.
284+
278285
.. _io.headers:
279286

280287
Handling column names
@@ -1029,6 +1036,22 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
10291036
os.remove('tmp.sv')
10301037
os.remove('tmp2.sv')
10311038
1039+
Specifying the parser engine
1040+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1041+
1042+
Under the hood pandas uses a fast and efficient parser implemented in C as well
1043+
as a python implementation which is currently more feature-complete. Where
1044+
possible pandas uses the C parser (specified as ``engine='c'``), but may fall
1045+
back to python if C-unsupported options are specified. Currently, C-unsupported
1046+
options include:
1047+
1048+
- ``sep`` other than a single character (e.g. regex separators)
1049+
- ``skip_footer``
1050+
- ``sep=None`` with ``delim_whitespace=False``
1051+
1052+
Specifying any of the above options will produce a ``ParserWarning`` unless the
1053+
python engine is selected explicitly using ``engine='python'``.
1054+
10321055
.. _io.store_in_csv:
10331056

10341057
Writing to CSV format

doc/source/release.rst

+16
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,8 @@ API Changes
176176
- ``.quantile`` on a ``datetime[ns]`` series now returns ``Timestamp`` instead
177177
of ``np.datetime64`` objects (:issue:`6810`)
178178
- change ``AssertionError`` to ``TypeError`` for invalid types passed to ``concat`` (:issue:`6583`)
179+
- Add :class:`~pandas.io.parsers.ParserWarning` class for fallback and option
180+
validation warnings in :func:`read_csv`/:func:`read_table` (:issue:`6607`)
179181

180182
Deprecations
181183
~~~~~~~~~~~~
@@ -280,6 +282,9 @@ Improvements to existing features
280282
- Added ``how`` option to rolling-moment functions to dictate how to handle resampling; :func:``rolling_max`` defaults to max,
281283
:func:``rolling_min`` defaults to min, and all others default to mean (:issue:`6297`)
282284
- ``pd.stats.moments.rolling_var`` now uses Welford's method for increased numerical stability (:issue:`6817`)
285+
- Translate ``sep='\s+'`` to ``delim_whitespace=True`` in
286+
:func:`read_csv`/:func:`read_table` if no other C-unsupported options
287+
specified (:issue:`6607`)
283288

284289
.. _release.bug_fixes-0.14.0:
285290

@@ -402,6 +407,17 @@ Bug Fixes
402407
- Bug in `DataFrame.plot` and `Series.plot` legend behave inconsistently when plotting to the same axes repeatedly (:issue:`6678`)
403408
- Internal tests for patching ``__finalize__`` / bug in merge not finalizing (:issue:`6923`, :issue:`6927`)
404409
- accept ``TextFileReader`` in ``concat``, which was affecting a common user idiom (:issue:`6583`)
410+
- Raise :class:`ValueError` when ``sep`` specified with
411+
``delim_whitespace=True`` in :func:`read_csv`/:func:`read_table`
412+
(:issue:`6607`)
413+
- Raise :class:`ValueError` when `engine='c'` specified with unsupported
414+
options (:issue:`6607`)
415+
- Raise :class:`ValueError` when fallback to python parser causes options to be
416+
ignored (:issue:`6607`)
417+
- Produce :class:`~pandas.io.parsers.ParserWarning` on fallback to python
418+
parser when no options are ignored (:issue:`6607`)
419+
- Bug in C parser with leading whitespace (:issue:`3374`)
420+
- Bug in C parser with ``delim_whitespace=True`` and ``\r``-delimited lines
405421

406422
pandas 0.13.1
407423
-------------

pandas/io/parsers.py

+81-21
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from pandas import compat
77
import re
88
import csv
9+
import warnings
910

1011
import numpy as np
1112

@@ -24,6 +25,8 @@
2425
import pandas.tslib as tslib
2526
import pandas.parser as _parser
2627

28+
class ParserWarning(Warning):
29+
pass
2730

2831
_parser_params = """Also supports optionally iterating or breaking of the file
2932
into chunks.
@@ -50,6 +53,7 @@
5053
One-character string used to escape delimiter when quoting is QUOTE_NONE.
5154
dtype : Type name or dict of column -> type
5255
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
56+
(Unsupported with engine='python')
5357
compression : {'gzip', 'bz2', None}, default None
5458
For on-the-fly decompression of on-disk data
5559
dialect : string or csv.Dialect instance, default None
@@ -113,7 +117,7 @@
113117
chunksize : int, default None
114118
Return TextFileReader object for iteration
115119
skipfooter : int, default 0
116-
Number of line at bottom of file to skip
120+
Number of lines at bottom of file to skip (Unsupported with engine='c')
117121
converters : dict. optional
118122
Dict of functions for converting values in certain columns. Keys can either
119123
be integers or column labels
@@ -125,24 +129,24 @@
125129
Encoding to use for UTF when reading/writing (ex. 'utf-8')
126130
squeeze : boolean, default False
127131
If the parsed data only contains one column then return a Series
128-
na_filter: boolean, default True
132+
na_filter : boolean, default True
129133
Detect missing value markers (empty strings and the value of na_values). In
130134
data without any NAs, passing na_filter=False can improve the performance
131135
of reading a large file
132136
usecols : array-like
133137
Return a subset of the columns.
134138
Results in much faster parsing time and lower memory usage.
135-
mangle_dupe_cols: boolean, default True
139+
mangle_dupe_cols : boolean, default True
136140
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
137-
tupleize_cols: boolean, default False
141+
tupleize_cols : boolean, default False
138142
Leave a list of tuples on columns as is (default is to convert to
139143
a Multi Index on the columns)
140-
error_bad_lines: boolean, default True
144+
error_bad_lines : boolean, default True
141145
Lines with too many fields (e.g. a csv line with too many commas) will by
142146
default cause an exception to be raised, and no DataFrame will be returned.
143147
If False, then these "bad lines" will dropped from the DataFrame that is
144-
returned. (Only valid with C parser).
145-
warn_bad_lines: boolean, default True
148+
returned. (Only valid with C parser)
149+
warn_bad_lines : boolean, default True
146150
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
147151
"bad line" will be output. (Only valid with C parser).
148152
infer_datetime_format : boolean, default False
@@ -154,25 +158,30 @@
154158
result : DataFrame or TextParser
155159
"""
156160

157-
_csv_sep = """sep : string, default ','
161+
_csv_params = """sep : string, default ','
158162
Delimiter to use. If sep is None, will try to automatically determine
159163
this. Regular expressions are accepted.
160-
"""
164+
engine : {'c', 'python'}
165+
Parser engine to use. The C engine is faster while the python engine is
166+
currently more feature-complete."""
161167

162-
_table_sep = """sep : string, default \\t (tab-stop)
163-
Delimiter to use. Regular expressions are accepted."""
168+
_table_params = """sep : string, default \\t (tab-stop)
169+
Delimiter to use. Regular expressions are accepted.
170+
engine : {'c', 'python'}
171+
Parser engine to use. The C engine is faster while the python engine is
172+
currently more feature-complete."""
164173

165174
_read_csv_doc = """
166175
Read CSV (comma-separated) file into DataFrame
167176
168177
%s
169-
""" % (_parser_params % _csv_sep)
178+
""" % (_parser_params % _csv_params)
170179

171180
_read_table_doc = """
172181
Read general delimited file into DataFrame
173182
174183
%s
175-
""" % (_parser_params % _table_sep)
184+
""" % (_parser_params % _table_params)
176185

177186
_fwf_widths = """\
178187
colspecs : list of pairs (int, int) or 'infer'. optional
@@ -297,6 +306,8 @@ def _read(filepath_or_buffer, kwds):
297306

298307
def _make_parser_function(name, sep=','):
299308

309+
default_sep = sep
310+
300311
def parser_f(filepath_or_buffer,
301312
sep=sep,
302313
dialect=None,
@@ -325,7 +336,7 @@ def parser_f(filepath_or_buffer,
325336
dtype=None,
326337
usecols=None,
327338

328-
engine='c',
339+
engine=None,
329340
delim_whitespace=False,
330341
as_recarray=False,
331342
na_filter=True,
@@ -362,10 +373,21 @@ def parser_f(filepath_or_buffer,
362373
if delimiter is None:
363374
delimiter = sep
364375

376+
if delim_whitespace and delimiter is not default_sep:
377+
raise ValueError("Specified a delimiter with both sep and"\
378+
" delim_whitespace=True; you can only specify one.")
379+
380+
if engine is not None:
381+
engine_specified = True
382+
else:
383+
engine = 'c'
384+
engine_specified = False
385+
365386
kwds = dict(delimiter=delimiter,
366387
engine=engine,
367388
dialect=dialect,
368389
compression=compression,
390+
engine_specified=engine_specified,
369391

370392
doublequote=doublequote,
371393
escapechar=escapechar,
@@ -468,10 +490,18 @@ class TextFileReader(object):
468490
469491
"""
470492

471-
def __init__(self, f, engine='python', **kwds):
493+
def __init__(self, f, engine=None, **kwds):
472494

473495
self.f = f
474496

497+
if engine is not None:
498+
engine_specified = True
499+
else:
500+
engine = 'python'
501+
engine_specified = False
502+
503+
self._engine_specified = kwds.get('engine_specified', engine_specified)
504+
475505
if kwds.get('dialect') is not None:
476506
dialect = kwds['dialect']
477507
kwds['delimiter'] = dialect.delimiter
@@ -530,30 +560,60 @@ def _get_options_with_defaults(self, engine):
530560
def _clean_options(self, options, engine):
531561
result = options.copy()
532562

563+
engine_specified = self._engine_specified
564+
fallback_reason = None
565+
533566
sep = options['delimiter']
534567
delim_whitespace = options['delim_whitespace']
535568

569+
# C engine not supported yet
570+
if engine == 'c':
571+
if options['skip_footer'] > 0:
572+
fallback_reason = "the 'c' engine does not support"\
573+
" skip_footer"
574+
engine = 'python'
575+
536576
if sep is None and not delim_whitespace:
537577
if engine == 'c':
578+
fallback_reason = "the 'c' engine does not support"\
579+
" sep=None with delim_whitespace=False"
538580
engine = 'python'
539581
elif sep is not None and len(sep) > 1:
540-
# wait until regex engine integrated
541-
if engine not in ('python', 'python-fwf'):
582+
if engine == 'c' and sep == '\s+':
583+
result['delim_whitespace'] = True
584+
del result['delimiter']
585+
elif engine not in ('python', 'python-fwf'):
586+
# wait until regex engine integrated
587+
fallback_reason = "the 'c' engine does not support"\
588+
" regex separators"
542589
engine = 'python'
543590

544-
# C engine not supported yet
545-
if engine == 'c':
546-
if options['skip_footer'] > 0:
547-
engine = 'python'
591+
if fallback_reason and engine_specified:
592+
raise ValueError(fallback_reason)
548593

549594
if engine == 'c':
550595
for arg in _c_unsupported:
551596
del result[arg]
552597

553598
if 'python' in engine:
554599
for arg in _python_unsupported:
600+
if fallback_reason and result[arg] != _c_parser_defaults[arg]:
601+
msg = ("Falling back to the 'python' engine because"
602+
" {reason}, but this causes {option!r} to be"
603+
" ignored as it is not supported by the 'python'"
604+
" engine.").format(reason=fallback_reason, option=arg)
605+
if arg == 'dtype':
606+
msg += " (Note the 'converters' option provides"\
607+
" similar functionality.)"
608+
raise ValueError(msg)
555609
del result[arg]
556610

611+
if fallback_reason:
612+
warnings.warn(("Falling back to the 'python' engine because"
613+
" {0}; you can avoid this warning by specifying"
614+
" engine='python'.").format(fallback_reason),
615+
ParserWarning)
616+
557617
index_col = options['index_col']
558618
names = options['names']
559619
converters = options['converters']

pandas/io/tests/test_cparser.py

+3
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,9 @@ def _test(text, **kwargs):
323323
data = 'A B C\r 2 3\r4 5 6'
324324
_test(data, delim_whitespace=True)
325325

326+
data = 'A B C\r2 3\r4 5 6'
327+
_test(data, delim_whitespace=True)
328+
326329
def test_empty_field_eof(self):
327330
data = 'a,b,c\n1,2,3\n4,,'
328331

0 commit comments

Comments
 (0)