Skip to content

Commit 883df65

Browse files
gfyoungjreback
authored andcommitted
BUG: Fix inconsistent C engine quoting behaviour
1) Add significant testing to quoting in read_csv 2) Fix bug in C engine in which a NULL `quotechar` would raise even though `quoting=csv.QUOTE_NONE`. 3) Fix bug in C engine in which `quoting=csv.QUOTE_NONNUMERIC` wouldn't cause non-quoted fields to be cast to `float`. Author: gfyoung <[email protected]> Closes #13411 from gfyoung/quoting-read-csv-tests and squashes the following commits: 0e791a5 [gfyoung] BUG: Fix inconsistent C engine quoting behaviour
1 parent 013c2ce commit 883df65

File tree

8 files changed

+196
-30
lines changed

8 files changed

+196
-30
lines changed

doc/source/io.rst

+2-3
Original file line numberDiff line numberDiff line change
@@ -287,11 +287,10 @@ lineterminator : str (length 1), default ``None``
287287
quotechar : str (length 1)
288288
The character used to denote the start and end of a quoted item. Quoted items
289289
can include the delimiter and it will be ignored.
290-
quoting : int or ``csv.QUOTE_*`` instance, default ``None``
290+
quoting : int or ``csv.QUOTE_*`` instance, default ``0``
291291
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
292292
``QUOTE_MINIMAL`` (0), ``QUOTE_ALL`` (1), ``QUOTE_NONNUMERIC`` (2) or
293-
``QUOTE_NONE`` (3). Default (``None``) results in ``QUOTE_MINIMAL``
294-
behavior.
293+
``QUOTE_NONE`` (3).
295294
doublequote : boolean, default ``True``
296295
When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``,
297296
indicate whether or not to interpret two consecutive ``quotechar`` elements

doc/source/merging.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -1113,7 +1113,7 @@ Timeseries friendly merging
11131113
Merging Ordered Data
11141114
~~~~~~~~~~~~~~~~~~~~
11151115

1116-
A :func:`pd.merge_ordered` function allows combining time series and other
1116+
A :func:`merge_ordered` function allows combining time series and other
11171117
ordered data. In particular it has an optional ``fill_method`` keyword to
11181118
fill/interpolate missing data:
11191119

@@ -1135,7 +1135,7 @@ Merging AsOf
11351135

11361136
.. versionadded:: 0.18.2
11371137

1138-
A :func:`pd.merge_asof` is similar to an ordered left-join except that we match on nearest key rather than equal keys. For each row in the ``left`` DataFrame, we select the last row in the ``right`` DataFrame whose ``on`` key is less than the left's key. Both DataFrames must be sorted by the key.
1138+
A :func:`merge_asof` is similar to an ordered left-join except that we match on nearest key rather than equal keys. For each row in the ``left`` DataFrame, we select the last row in the ``right`` DataFrame whose ``on`` key is less than the left's key. Both DataFrames must be sorted by the key.
11391139

11401140
Optionally an asof merge can perform a group-wise merge. This matches the ``by`` key equally,
11411141
in addition to the nearest match on the ``on`` key.

doc/source/whatsnew/v0.18.2.txt

+14-10
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ We recommend that all users upgrade to this version.
99

1010
Highlights include:
1111

12-
- ``pd.merge_asof()`` for asof-style time-series joining, see :ref:`here <whatsnew_0182.enhancements.asof_merge>`
12+
- :func:`merge_asof` for asof-style time-series joining, see :ref:`here <whatsnew_0182.enhancements.asof_merge>`
1313

1414
.. contents:: What's new in v0.18.2
1515
:local:
@@ -22,8 +22,8 @@ New features
2222

2323
.. _whatsnew_0182.enhancements.asof_merge:
2424

25-
``pd.merge_asof()`` for asof-style time-series joining
26-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
25+
:func:`merge_asof` for asof-style time-series joining
26+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2727

2828
A long-time requested feature has been added through the :func:`merge_asof` function, to
2929
support asof style joining of time-series. (:issue:`1870`). Full documentation is
@@ -113,10 +113,10 @@ passed DataFrame (``trades`` in this case), with the fields of the ``quotes`` me
113113

114114
.. _whatsnew_0182.enhancements.read_csv_dupe_col_names_support:
115115

116-
``pd.read_csv`` has improved support for duplicate column names
117-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116+
:func:`read_csv` has improved support for duplicate column names
117+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
118118

119-
:ref:`Duplicate column names <io.dupe_names>` are now supported in ``pd.read_csv()`` whether
119+
:ref:`Duplicate column names <io.dupe_names>` are now supported in :func:`read_csv` whether
120120
they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :issue:`9424`)
121121

122122
.. ipython :: python
@@ -187,7 +187,7 @@ Other enhancements
187187

188188
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behaviour remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
189189

190-
- ``Index`` now supports ``.str.extractall()`` which returns ``DataFrame``, see :ref:`Extract all matches in each subject (extractall) <text.extractall>` (:issue:`10008`, :issue:`13156`)
190+
- ``Index`` now supports ``.str.extractall()`` which returns a ``DataFrame``, see :ref:`documentation here <text.extractall>` (:issue:`10008`, :issue:`13156`)
191191
- ``.to_hdf/read_hdf()`` now accept path objects (e.g. ``pathlib.Path``, ``py.path.local``) for the file path (:issue:`11773`)
192192

193193
.. ipython:: python
@@ -406,6 +406,8 @@ New Behavior:
406406
s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
407407
df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
408408

409+
Furthermore:
410+
409411
- Passing duplicated ``percentiles`` will now raise a ``ValueError``.
410412
- Bug in ``.describe()`` on a DataFrame with a mixed-dtype column index, which would previously raise a ``TypeError`` (:issue:`13288`)
411413

@@ -462,7 +464,7 @@ Bug Fixes
462464

463465
- Bug in calling ``.memory_usage()`` on object which doesn't implement (:issue:`12924`)
464466

465-
- Regression in ``Series.quantile`` with nans (also shows up in ``.median()`` and ``.describe()``); furthermore now names the ``Series`` with the quantile (:issue:`13098`, :issue:`13146`)
467+
- Regression in ``Series.quantile`` with nans (also shows up in ``.median()`` and ``.describe()`` ); furthermore now names the ``Series`` with the quantile (:issue:`13098`, :issue:`13146`)
466468

467469
- Bug in ``SeriesGroupBy.transform`` with datetime values and missing groups (:issue:`13191`)
468470

@@ -473,9 +475,9 @@ Bug Fixes
473475
- Bug in ``PeriodIndex`` and ``Period`` subtraction raises ``AttributeError`` (:issue:`13071`)
474476
- Bug in ``PeriodIndex`` construction returning a ``float64`` index in some circumstances (:issue:`13067`)
475477
- Bug in ``.resample(..)`` with a ``PeriodIndex`` not changing its ``freq`` appropriately when empty (:issue:`13067`)
476-
- Bug in ``.resample(..)`` with a ``PeriodIndex`` not retaining its type or name with an empty ``DataFrame``appropriately when empty (:issue:`13212`)
478+
- Bug in ``.resample(..)`` with a ``PeriodIndex`` not retaining its type or name with an empty ``DataFrame`` appropriately when empty (:issue:`13212`)
477479
- Bug in ``groupby(..).resample(..)`` where passing some keywords would raise an exception (:issue:`13235`)
478-
- Bug in ``.tz_convert`` on a tz-aware ``DateTimeIndex`` that relied on index being sorted for correct results (:issue: `13306`)
480+
- Bug in ``.tz_convert`` on a tz-aware ``DateTimeIndex`` that relied on index being sorted for correct results (:issue:`13306`)
479481
- Bug in ``pd.read_hdf()`` where attempting to load an HDF file with a single dataset, that had one or more categorical columns, failed unless the key argument was set to the name of the dataset. (:issue:`13231`)
480482

481483

@@ -493,6 +495,8 @@ Bug Fixes
493495
- Bug in ``pd.read_csv()`` with ``engine='python'`` in which trailing ``NaN`` values were not being parsed (:issue:`13320`)
494496
- Bug in ``pd.read_csv()`` that prevents ``usecols`` kwarg from accepting single-byte unicode strings (:issue:`13219`)
495497
- Bug in ``pd.read_csv()`` that prevents ``usecols`` from being an empty set (:issue:`13402`)
498+
- Bug in ``pd.read_csv()`` with ``engine=='c'`` in which null ``quotechar`` was not accepted even though ``quoting`` was specified as ``None`` (:issue:`13411`)
499+
- Bug in ``pd.read_csv()`` with ``engine=='c'`` in which fields were not properly cast to float when quoting was specified as non-numeric (:issue:`13411`)
496500

497501

498502

pandas/io/parsers.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -202,10 +202,9 @@
202202
quotechar : str (length 1), optional
203203
The character used to denote the start and end of a quoted item. Quoted
204204
items can include the delimiter and it will be ignored.
205-
quoting : int or csv.QUOTE_* instance, default None
205+
quoting : int or csv.QUOTE_* instance, default 0
206206
Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
207207
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
208-
Default (None) results in QUOTE_MINIMAL behavior.
209208
doublequote : boolean, default ``True``
210209
When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
211210
whether or not to interpret two consecutive quotechar elements INSIDE a

pandas/io/tests/parser/quoting.py

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# -*- coding: utf-8 -*-
2+
3+
"""
4+
Tests that quoting specifications are properly handled
5+
during parsing for all of the parsers defined in parsers.py
6+
"""
7+
8+
import csv
9+
import pandas.util.testing as tm
10+
11+
from pandas import DataFrame
12+
from pandas.compat import StringIO
13+
14+
15+
class QuotingTests(object):
16+
17+
def test_bad_quote_char(self):
18+
data = '1,2,3'
19+
20+
# Python 2.x: "...must be an 1-character..."
21+
# Python 3.x: "...must be a 1-character..."
22+
msg = '"quotechar" must be a(n)? 1-character string'
23+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
24+
StringIO(data), quotechar='foo')
25+
26+
msg = 'quotechar must be set if quoting enabled'
27+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
28+
StringIO(data), quotechar=None,
29+
quoting=csv.QUOTE_MINIMAL)
30+
31+
msg = '"quotechar" must be string, not int'
32+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
33+
StringIO(data), quotechar=2)
34+
35+
def test_bad_quoting(self):
36+
data = '1,2,3'
37+
38+
msg = '"quoting" must be an integer'
39+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
40+
StringIO(data), quoting='foo')
41+
42+
# quoting must in the range [0, 3]
43+
msg = 'bad "quoting" value'
44+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
45+
StringIO(data), quoting=5)
46+
47+
def test_quote_char_basic(self):
48+
data = 'a,b,c\n1,2,"cat"'
49+
expected = DataFrame([[1, 2, 'cat']],
50+
columns=['a', 'b', 'c'])
51+
result = self.read_csv(StringIO(data), quotechar='"')
52+
tm.assert_frame_equal(result, expected)
53+
54+
def test_quote_char_various(self):
55+
data = 'a,b,c\n1,2,"cat"'
56+
expected = DataFrame([[1, 2, 'cat']],
57+
columns=['a', 'b', 'c'])
58+
quote_chars = ['~', '*', '%', '$', '@', 'P']
59+
60+
for quote_char in quote_chars:
61+
new_data = data.replace('"', quote_char)
62+
result = self.read_csv(StringIO(new_data), quotechar=quote_char)
63+
tm.assert_frame_equal(result, expected)
64+
65+
def test_null_quote_char(self):
66+
data = 'a,b,c\n1,2,3'
67+
68+
# sanity checks
69+
msg = 'quotechar must be set if quoting enabled'
70+
71+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
72+
StringIO(data), quotechar=None,
73+
quoting=csv.QUOTE_MINIMAL)
74+
75+
tm.assertRaisesRegexp(TypeError, msg, self.read_csv,
76+
StringIO(data), quotechar='',
77+
quoting=csv.QUOTE_MINIMAL)
78+
79+
# no errors should be raised if quoting is None
80+
expected = DataFrame([[1, 2, 3]],
81+
columns=['a', 'b', 'c'])
82+
83+
result = self.read_csv(StringIO(data), quotechar=None,
84+
quoting=csv.QUOTE_NONE)
85+
tm.assert_frame_equal(result, expected)
86+
87+
result = self.read_csv(StringIO(data), quotechar='',
88+
quoting=csv.QUOTE_NONE)
89+
tm.assert_frame_equal(result, expected)
90+
91+
def test_quoting_various(self):
92+
data = '1,2,"foo"'
93+
cols = ['a', 'b', 'c']
94+
95+
# QUOTE_MINIMAL and QUOTE_ALL apply only to
96+
# the CSV writer, so they should have no
97+
# special effect for the CSV reader
98+
expected = DataFrame([[1, 2, 'foo']], columns=cols)
99+
100+
# test default (afterwards, arguments are all explicit)
101+
result = self.read_csv(StringIO(data), names=cols)
102+
tm.assert_frame_equal(result, expected)
103+
104+
result = self.read_csv(StringIO(data), quotechar='"',
105+
quoting=csv.QUOTE_MINIMAL, names=cols)
106+
tm.assert_frame_equal(result, expected)
107+
108+
result = self.read_csv(StringIO(data), quotechar='"',
109+
quoting=csv.QUOTE_ALL, names=cols)
110+
tm.assert_frame_equal(result, expected)
111+
112+
# QUOTE_NONE tells the reader to do no special handling
113+
# of quote characters and leave them alone
114+
expected = DataFrame([[1, 2, '"foo"']], columns=cols)
115+
result = self.read_csv(StringIO(data), quotechar='"',
116+
quoting=csv.QUOTE_NONE, names=cols)
117+
tm.assert_frame_equal(result, expected)
118+
119+
# QUOTE_NONNUMERIC tells the reader to cast
120+
# all non-quoted fields to float
121+
expected = DataFrame([[1.0, 2.0, 'foo']], columns=cols)
122+
result = self.read_csv(StringIO(data), quotechar='"',
123+
quoting=csv.QUOTE_NONNUMERIC,
124+
names=cols)
125+
tm.assert_frame_equal(result, expected)
126+
127+
def test_double_quote(self):
128+
data = 'a,b\n3,"4 "" 5"'
129+
130+
expected = DataFrame([[3, '4 " 5']],
131+
columns=['a', 'b'])
132+
result = self.read_csv(StringIO(data), quotechar='"',
133+
doublequote=True)
134+
tm.assert_frame_equal(result, expected)
135+
136+
expected = DataFrame([[3, '4 " 5"']],
137+
columns=['a', 'b'])
138+
result = self.read_csv(StringIO(data), quotechar='"',
139+
doublequote=False)
140+
tm.assert_frame_equal(result, expected)

pandas/io/tests/parser/test_parsers.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from .common import ParserTests
1212
from .header import HeaderTests
1313
from .comment import CommentTests
14+
from .quoting import QuotingTests
1415
from .usecols import UsecolsTests
1516
from .skiprows import SkipRowsTests
1617
from .index_col import IndexColTests
@@ -28,7 +29,7 @@ class BaseParser(CommentTests, CompressionTests,
2829
IndexColTests, MultithreadTests,
2930
NAvaluesTests, ParseDatesTests,
3031
ParserTests, SkipRowsTests,
31-
UsecolsTests):
32+
UsecolsTests, QuotingTests):
3233
def read_csv(self, *args, **kwargs):
3334
raise NotImplementedError
3435

pandas/parser.pyx

+33-9
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ from libc.string cimport strncpy, strlen, strcmp, strcasecmp
77
cimport libc.stdio as stdio
88
import warnings
99

10+
from csv import QUOTE_MINIMAL, QUOTE_NONNUMERIC, QUOTE_NONE
1011
from cpython cimport (PyObject, PyBytes_FromString,
1112
PyBytes_AsString, PyBytes_Check,
1213
PyUnicode_Check, PyUnicode_AsUTF8String)
@@ -283,6 +284,7 @@ cdef class TextReader:
283284
object compression
284285
object mangle_dupe_cols
285286
object tupleize_cols
287+
list dtype_cast_order
286288
set noconvert, usecols
287289

288290
def __cinit__(self, source,
@@ -393,8 +395,13 @@ cdef class TextReader:
393395
raise ValueError('Only length-1 escapes supported')
394396
self.parser.escapechar = ord(escapechar)
395397

396-
self.parser.quotechar = ord(quotechar)
397-
self.parser.quoting = quoting
398+
self._set_quoting(quotechar, quoting)
399+
400+
# TODO: endianness just a placeholder?
401+
if quoting == QUOTE_NONNUMERIC:
402+
self.dtype_cast_order = ['<f8', '<i8', '|b1', '|O8']
403+
else:
404+
self.dtype_cast_order = ['<i8', '<f8', '|b1', '|O8']
398405

399406
if comment is not None:
400407
if len(comment) > 1:
@@ -548,6 +555,29 @@ cdef class TextReader:
548555
def set_error_bad_lines(self, int status):
549556
self.parser.error_bad_lines = status
550557

558+
def _set_quoting(self, quote_char, quoting):
559+
if not isinstance(quoting, int):
560+
raise TypeError('"quoting" must be an integer')
561+
562+
if not QUOTE_MINIMAL <= quoting <= QUOTE_NONE:
563+
raise TypeError('bad "quoting" value')
564+
565+
if not isinstance(quote_char, (str, bytes)) and quote_char is not None:
566+
dtype = type(quote_char).__name__
567+
raise TypeError('"quotechar" must be string, '
568+
'not {dtype}'.format(dtype=dtype))
569+
570+
if quote_char is None or quote_char == '':
571+
if quoting != QUOTE_NONE:
572+
raise TypeError("quotechar must be set if quoting enabled")
573+
self.parser.quoting = quoting
574+
self.parser.quotechar = -1
575+
elif len(quote_char) > 1: # 0-len case handled earlier
576+
raise TypeError('"quotechar" must be a 1-character string')
577+
else:
578+
self.parser.quoting = quoting
579+
self.parser.quotechar = ord(quote_char)
580+
551581
cdef _make_skiprow_set(self):
552582
if isinstance(self.skiprows, (int, np.integer)):
553583
parser_set_skipfirstnrows(self.parser, self.skiprows)
@@ -1066,7 +1096,7 @@ cdef class TextReader:
10661096
return self._string_convert(i, start, end, na_filter, na_hashset)
10671097
else:
10681098
col_res = None
1069-
for dt in dtype_cast_order:
1099+
for dt in self.dtype_cast_order:
10701100
try:
10711101
col_res, na_count = self._convert_with_dtype(
10721102
dt, i, start, end, na_filter, 0, na_hashset, na_flist)
@@ -1847,12 +1877,6 @@ cdef kh_float64_t* kset_float64_from_list(values) except NULL:
18471877
return table
18481878

18491879

1850-
# if at first you don't succeed...
1851-
1852-
# TODO: endianness just a placeholder?
1853-
cdef list dtype_cast_order = ['<i8', '<f8', '|b1', '|O8']
1854-
1855-
18561880
cdef raise_parser_error(object base, parser_t *parser):
18571881
message = '%s. C error: ' % base
18581882
if parser.error_msg != NULL:

pandas/tools/merge.py

+2-3
Original file line numberDiff line numberDiff line change
@@ -337,9 +337,8 @@ def merge_asof(left, right, on=None,
337337
1 5 b 3.0
338338
2 10 c 7.0
339339
340-
For this example, we can achieve a similar result thru ``pd.merge_ordered()``,
341-
though its not nearly as performant.
342-
340+
For this example, we can achieve a similar result thru
341+
``pd.merge_ordered()``, though its not nearly as performant.
343342
344343
>>> (pd.merge_ordered(left, right, on='a')
345344
... .ffill()

0 commit comments

Comments
 (0)