Skip to content

Commit 152b685

Browse files
committed
API: Warn or raise for > 1 char encoded sep
The system file encoding can cause a separator to be encoded as more than one character even though it maybe provided as one character. Multi-char separators are not supported by the C engine, so we need to catch this case. Closes gh-14065.
1 parent 10bf721 commit 152b685

File tree

3 files changed

+13
-0
lines changed

3 files changed

+13
-0
lines changed

doc/source/whatsnew/v0.19.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,7 @@ API changes
457457
- ``pd.Timedelta(None)`` is now accepted and will return ``NaT``, mirroring ``pd.Timestamp`` (:issue:`13687`)
458458
- ``Timestamp``, ``Period``, ``DatetimeIndex``, ``PeriodIndex`` and ``.dt`` accessor have gained a ``.is_leap_year`` property to check whether the date belongs to a leap year. (:issue:`13727`)
459459
- ``pd.read_hdf`` will now raise a ``ValueError`` instead of ``KeyError``, if a mode other than ``r``, ``r+`` and ``a`` is supplied. (:issue:`13623`)
460+
- ``pd.read_csv()`` in the C engine will now issue a ``ParserWarning`` or raise a ``ValueError`` when ``sep`` encoded is more than one character long (:issue:`14065`)
460461
- ``DataFrame.values`` will now return ``float64`` with a ``DataFrame`` of mixed ``int64`` and ``uint64`` dtypes, conforming to ``np.find_common_type`` (:issue:`10364`, :issue:`13917`)
461462
- ``Series.unique()`` with datetime and timezone now returns return array of ``Timestamp`` with timezone (:issue:`13565`)
462463

pandas/io/parsers.py

+10
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from collections import defaultdict
66
import re
77
import csv
8+
import sys
89
import warnings
910
import datetime
1011

@@ -782,6 +783,7 @@ def _clean_options(self, options, engine):
782783
" skipfooter"
783784
engine = 'python'
784785

786+
encoding = sys.getfilesystemencoding() or 'utf-8'
785787
if sep is None and not delim_whitespace:
786788
if engine == 'c':
787789
fallback_reason = "the 'c' engine does not support"\
@@ -798,6 +800,14 @@ def _clean_options(self, options, engine):
798800
" different from '\s+' are"\
799801
" interpreted as regex)"
800802
engine = 'python'
803+
804+
elif len(sep.encode(encoding)) > 1:
805+
if engine not in ('python', 'python-fwf'):
806+
fallback_reason = "the separator encoded in {encoding}"\
807+
" is > 1 char long, and the 'c' engine"\
808+
" does not support such separators".format(
809+
encoding=encoding)
810+
engine = 'python'
801811
elif delim_whitespace:
802812
if 'python' in engine:
803813
result['delimiter'] = '\s+'

pandas/io/tests/parser/test_unsupported.py

+2
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ def test_c_engine(self):
6060
sep=None, delim_whitespace=False)
6161
with tm.assertRaisesRegexp(ValueError, msg):
6262
read_table(StringIO(data), engine='c', sep='\s')
63+
with tm.assertRaisesRegexp(ValueError, msg):
64+
read_table(StringIO(data), engine='c', sep='§')
6365
with tm.assertRaisesRegexp(ValueError, msg):
6466
read_table(StringIO(data), engine='c', skipfooter=1)
6567

0 commit comments

Comments
 (0)