Skip to content

Commit 5db52f0

Browse files
gfyoungjreback
authored andcommitted
API: Warn or raise for > 1 char encoded sep
The system file encoding can cause a separator to be encoded as more than one character even though it maybe provided as one character. Multi-char separators are not supported by the C engine, so we need to catch this case. Closes #14065. Author: gfyoung <[email protected]> Closes #14120 from gfyoung/multi-char-encoded and squashes the following commits: 152b685 [gfyoung] API: Warn or raise for > 1 char encoded sep
1 parent b2a73b8 commit 5db52f0

File tree

3 files changed

+21
-8
lines changed

3 files changed

+21
-8
lines changed

doc/source/whatsnew/v0.19.0.txt

+9-8
Original file line numberDiff line numberDiff line change
@@ -414,12 +414,12 @@ Other enhancements
414414
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``decimal`` option (:issue:`12933`)
415415
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``na_filter`` option (:issue:`13321`)
416416
- The ``pd.read_csv()`` with ``engine='python'`` has gained support for the ``memory_map`` option (:issue:`13381`)
417+
- Consistent with the Python API, ``pd.read_csv()`` will now interpret ``+inf`` as positive infinity (:issue:`13274`)
417418

418419
- The ``pd.read_html()`` has gained support for the ``na_values``, ``converters``, ``keep_default_na`` options (:issue:`13461`)
419420

420421
- ``Categorical.astype()`` now accepts an optional boolean argument ``copy``, effective when dtype is categorical (:issue:`13209`)
421422
- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
422-
- Consistent with the Python API, ``pd.read_csv()`` will now interpret ``+inf`` as positive infinity (:issue:`13274`)
423423
- The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
424424
- ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
425425
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, issue:`13846`)
@@ -473,6 +473,7 @@ API changes
473473
- ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`)
474474
- ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`)
475475
- ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`)
476+
- ``pd.read_csv()`` in the C engine will now issue a ``ParserWarning`` or raise a ``ValueError`` when ``sep`` encoded is more than one character long (:issue:`14065`)
476477
- ``DataFrame.values`` will now return ``float64`` with a ``DataFrame`` of mixed ``int64`` and ``uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`)
477478
- ``Series.unique()`` with datetime and timezone now returns return array of ``Timestamp`` with timezone (:issue:`13565`)
478479

@@ -1211,10 +1212,6 @@ Bug Fixes
12111212

12121213
- Bug in ``groupby().shift()``, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (:issue:`13813`)
12131214
- Bug in ``groupby().cumsum()`` calculating ``cumprod`` when ``axis=1``. (:issue:`13994`)
1214-
- Bug in ``pd.read_csv()``, which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (:issue:`13703`)
1215-
- Bug in ``pd.read_csv()``, which caused errors to be raised when a dictionary containing scalars is passed in for ``na_values`` (:issue:`12224`)
1216-
- Bug in ``pd.read_csv()``, which caused BOM files to be incorrectly parsed by not ignoring the BOM (:issue:`4793`)
1217-
- Bug in ``pd.read_csv()`` with ``engine='python'`` which raised errors when a numpy array was passed in for ``usecols`` (:issue:`12546`)
12181215
- Bug in ``pd.to_timedelta()`` in which the ``errors`` parameter was not being respected (:issue:`13613`)
12191216
- Bug in ``io.json.json_normalize()``, where non-ascii keys raised an exception (:issue:`13213`)
12201217
- Bug when passing a not-default-indexed ``Series`` as ``xerr`` or ``yerr`` in ``.plot()`` (:issue:`11858`)
@@ -1225,7 +1222,6 @@ Bug Fixes
12251222
- Bug in ``Categorical.from_codes()`` where an unhelpful error was raised when an invalid ``ordered`` parameter was passed in (:issue:`14058`)
12261223
- Bug in ``Series`` construction from a tuple of integers on windows not returning default dtype (int64) (:issue:`13646`)
12271224

1228-
- Bug in ``pd.read_csv()`` where the index columns were being incorrectly parsed when parsed as dates with a ``thousands`` parameter (:issue:`14066`)
12291225
- Bug in ``.groupby(..).resample(..)`` when the same object is called multiple times (:issue:`13174`)
12301226
- Bug in ``.to_records()`` when index name is a unicode string (:issue:`13172`)
12311227

@@ -1267,6 +1263,11 @@ Bug Fixes
12671263
- Bug in ``MultiIndex.from_arrays`` which didn't check for input array lengths matching (:issue:`13599`)
12681264

12691265

1266+
- Bug in ``pd.read_csv()`` which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (:issue:`13703`)
1267+
- Bug in ``pd.read_csv()`` which caused errors to be raised when a dictionary containing scalars is passed in for ``na_values`` (:issue:`12224`)
1268+
- Bug in ``pd.read_csv()`` which caused BOM files to be incorrectly parsed by not ignoring the BOM (:issue:`4793`)
1269+
- Bug in ``pd.read_csv()`` with ``engine='python'`` which raised errors when a numpy array was passed in for ``usecols`` (:issue:`12546`)
1270+
- Bug in ``pd.read_csv()`` where the index columns were being incorrectly parsed when parsed as dates with a ``thousands`` parameter (:issue:`14066`)
12701271
- Bug in ``pd.read_csv()`` with ``engine='python'`` in which ``NaN`` values weren't being detected after data was converted to numeric values (:issue:`13314`)
12711272
- Bug in ``pd.read_csv()`` in which the ``nrows`` argument was not properly validated for both engines (:issue:`10476`)
12721273
- Bug in ``pd.read_csv()`` with ``engine='python'`` in which infinities of mixed-case forms were not being interpreted properly (:issue:`13274`)
@@ -1277,6 +1278,8 @@ Bug Fixes
12771278
- Bug in ``pd.read_csv()`` in the C engine where the NULL character was not being parsed as NULL (:issue:`14012`)
12781279
- Bug in ``pd.read_csv()`` with ``engine='c'`` in which NULL ``quotechar`` was not accepted even though ``quoting`` was specified as ``None`` (:issue:`13411`)
12791280
- Bug in ``pd.read_csv()`` with ``engine='c'`` in which fields were not properly cast to float when quoting was specified as non-numeric (:issue:`13411`)
1281+
- Bug in ``pd.read_csv()`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
1282+
- Bug in ``pd.read_csv()``, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (:issue:`13549`)
12801283
- Bug in ``pd.read_csv``, ``pd.read_table``, ``pd.read_fwf``, ``pd.read_stata`` and ``pd.read_sas`` where files were opened by parsers but not closed if both ``chunksize`` and ``iterator`` were ``None``. (:issue:`13940`)
12811284
- Bug in ``StataReader``, ``StataWriter``, ``XportReader`` and ``SAS7BDATReader`` where a file was not properly closed when an error was raised. (:issue:`13940`)
12821285

@@ -1351,11 +1354,9 @@ Bug Fixes
13511354
- Bug in operations on ``NaT`` returning ``float`` instead of ``datetime64[ns]`` (:issue:`12941`)
13521355
- Bug in ``Series`` flexible arithmetic methods (like ``.add()``) raises ``ValueError`` when ``axis=None`` (:issue:`13894`)
13531356

1354-
- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
13551357

13561358
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
13571359
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
13581360
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.
13591361

1360-
- Bug in ``read_csv()``, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (:issue:`13549`)
13611362
- Bug in ``eval()`` where the ``resolvers`` argument would not accept a list (:issue`14095`)

pandas/io/parsers.py

+10
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from collections import defaultdict
66
import re
77
import csv
8+
import sys
89
import warnings
910
import datetime
1011

@@ -782,6 +783,7 @@ def _clean_options(self, options, engine):
782783
" skipfooter"
783784
engine = 'python'
784785

786+
encoding = sys.getfilesystemencoding() or 'utf-8'
785787
if sep is None and not delim_whitespace:
786788
if engine == 'c':
787789
fallback_reason = "the 'c' engine does not support"\
@@ -798,6 +800,14 @@ def _clean_options(self, options, engine):
798800
" different from '\s+' are"\
799801
" interpreted as regex)"
800802
engine = 'python'
803+
804+
elif len(sep.encode(encoding)) > 1:
805+
if engine not in ('python', 'python-fwf'):
806+
fallback_reason = "the separator encoded in {encoding}"\
807+
" is > 1 char long, and the 'c' engine"\
808+
" does not support such separators".format(
809+
encoding=encoding)
810+
engine = 'python'
801811
elif delim_whitespace:
802812
if 'python' in engine:
803813
result['delimiter'] = '\s+'

pandas/io/tests/parser/test_unsupported.py

+2
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ def test_c_engine(self):
6060
sep=None, delim_whitespace=False)
6161
with tm.assertRaisesRegexp(ValueError, msg):
6262
read_table(StringIO(data), engine='c', sep='\s')
63+
with tm.assertRaisesRegexp(ValueError, msg):
64+
read_table(StringIO(data), engine='c', sep='§')
6365
with tm.assertRaisesRegexp(ValueError, msg):
6466
read_table(StringIO(data), engine='c', skipfooter=1)
6567

0 commit comments

Comments
 (0)