Skip to content

Commit e5ee5d2

Browse files
gfyoungjreback
authored andcommitted
BUG: Ignore the BOM in BOM UTF-8 CSV files
closes pandas-dev#4793 closes pandas-dev#13855
1 parent c8e7863 commit e5ee5d2

File tree

4 files changed

+186
-51
lines changed

4 files changed

+186
-51
lines changed

doc/source/whatsnew/v0.19.0.txt

+51-50
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ The following are now part of this API:
4343

4444
.. _whatsnew_0190.enhancements.asof_merge:
4545

46-
:func:`merge_asof` for asof-style time-series joining
47-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46+
``merge_asof`` for asof-style time-series joining
47+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4848

4949
A long-time requested feature has been added through the :func:`merge_asof` function, to
5050
support asof style joining of time-series. (:issue:`1870`, :issue:`13695`, :issue:`13709`). Full documentation is
@@ -192,8 +192,8 @@ default of the index) in a DataFrame.
192192

193193
.. _whatsnew_0190.enhancements.read_csv_dupe_col_names_support:
194194

195-
:func:`read_csv` has improved support for duplicate column names
196-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
195+
``read_csv`` has improved support for duplicate column names
196+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
197197

198198
:ref:`Duplicate column names <io.dupe_names>` are now supported in :func:`read_csv` whether
199199
they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :issue:`9424`)
@@ -307,48 +307,6 @@ Google BigQuery Enhancements
307307
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
308308
- The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`).
309309

310-
.. _whatsnew_0190.sparse:
311-
312-
Sparse changes
313-
~~~~~~~~~~~~~~
314-
315-
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
316-
317-
- Sparse data structure now can preserve ``dtype`` after arithmetic ops (:issue:`13848`)
318-
319-
.. ipython:: python
320-
321-
s = pd.SparseSeries([0, 2, 0, 1], fill_value=0, dtype=np.int64)
322-
s.dtype
323-
324-
s + 1
325-
326-
- Sparse data structure now support ``astype`` to convert internal ``dtype`` (:issue:`13900`)
327-
328-
.. ipython:: python
329-
330-
s = pd.SparseSeries([1., 0., 2., 0.], fill_value=0)
331-
s
332-
s.astype(np.int64)
333-
334-
``astype`` fails if data contains values which cannot be converted to specified ``dtype``.
335-
Note that the limitation is applied to ``fill_value`` which default is ``np.nan``.
336-
337-
.. code-block:: ipython
338-
339-
In [7]: pd.SparseSeries([1., np.nan, 2., np.nan], fill_value=np.nan).astype(np.int64)
340-
Out[7]:
341-
ValueError: unable to coerce current fill_value nan to int64 dtype
342-
343-
- Subclassed ``SparseDataFrame`` and ``SparseSeries`` now preserve class types when slicing or transposing. (:issue:`13787`)
344-
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing may raise ``IndexError`` (:issue:`13144`)
345-
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing result may have normal ``Index`` (:issue:`13144`)
346-
- Bug in ``SparseDataFrame`` in which ``axis=None`` did not default to ``axis=0`` (:issue:`13048`)
347-
- Bug in ``SparseSeries`` and ``SparseDataFrame`` creation with ``object`` dtype may raise ``TypeError`` (:issue:`11633`)
348-
- Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`)
349-
- Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
350-
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
351-
352310
.. _whatsnew_0190.enhancements.other:
353311

354312
Other enhancements
@@ -684,8 +642,8 @@ New Behavior:
684642

685643
.. _whatsnew_0190.api.autogenerated_chunksize_index:
686644

687-
:func:`read_csv` called with ``chunksize`` will progressively enumerate chunks
688-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
645+
``read_csv`` will progressively enumerate chunks
646+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
689647

690648
When :func:`read_csv` is called with ``chunksize='n'`` and without specifying an index,
691649
each chunk used to have an independently generated index from `0`` to ``n-1``.
@@ -716,10 +674,52 @@ New behaviour:
716674

717675
pd.concat(pd.read_csv(StringIO(data), chunksize=2))
718676

677+
.. _whatsnew_0190.sparse:
678+
679+
Sparse Changes
680+
^^^^^^^^^^^^^^
681+
682+
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
683+
684+
- Sparse data structure now can preserve ``dtype`` after arithmetic ops (:issue:`13848`)
685+
686+
.. ipython:: python
687+
688+
s = pd.SparseSeries([0, 2, 0, 1], fill_value=0, dtype=np.int64)
689+
s.dtype
690+
691+
s + 1
692+
693+
- Sparse data structure now support ``astype`` to convert internal ``dtype`` (:issue:`13900`)
694+
695+
.. ipython:: python
696+
697+
s = pd.SparseSeries([1., 0., 2., 0.], fill_value=0)
698+
s
699+
s.astype(np.int64)
700+
701+
``astype`` fails if data contains values which cannot be converted to specified ``dtype``.
702+
Note that the limitation is applied to ``fill_value`` which default is ``np.nan``.
703+
704+
.. code-block:: ipython
705+
706+
In [7]: pd.SparseSeries([1., np.nan, 2., np.nan], fill_value=np.nan).astype(np.int64)
707+
Out[7]:
708+
ValueError: unable to coerce current fill_value nan to int64 dtype
709+
710+
- Subclassed ``SparseDataFrame`` and ``SparseSeries`` now preserve class types when slicing or transposing. (:issue:`13787`)
711+
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing may raise ``IndexError`` (:issue:`13144`)
712+
- Bug in ``SparseSeries`` with ``MultiIndex`` ``[]`` indexing result may have normal ``Index`` (:issue:`13144`)
713+
- Bug in ``SparseDataFrame`` in which ``axis=None`` did not default to ``axis=0`` (:issue:`13048`)
714+
- Bug in ``SparseSeries`` and ``SparseDataFrame`` creation with ``object`` dtype may raise ``TypeError`` (:issue:`11633`)
715+
- Bug in ``SparseDataFrame`` doesn't respect passed ``SparseArray`` or ``SparseSeries`` 's dtype and ``fill_value`` (:issue:`13866`)
716+
- Bug in ``SparseArray`` and ``SparseSeries`` don't apply ufunc to ``fill_value`` (:issue:`13853`)
717+
- Bug in ``SparseSeries.abs`` incorrectly keeps negative ``fill_value`` (:issue:`13853`)
718+
719719
.. _whatsnew_0190.deprecations:
720720

721721
Deprecations
722-
^^^^^^^^^^^^
722+
~~~~~~~~~~~~
723723
- ``Categorical.reshape`` has been deprecated and will be removed in a subsequent release (:issue:`12882`)
724724
- ``Series.reshape`` has been deprecated and will be removed in a subsequent release (:issue:`12882`)
725725

@@ -738,7 +738,7 @@ Deprecations
738738
.. _whatsnew_0190.prior_deprecations:
739739

740740
Removal of prior version deprecations/changes
741-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
741+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
742742
- The ``SparsePanel`` class has been removed (:issue:`13778`)
743743
- The ``pd.sandbox`` module has been removed in favor of the external library ``pandas-qt`` (:issue:`13670`)
744744
- The ``pandas.io.data`` and ``pandas.io.wb`` modules are removed in favor of
@@ -797,6 +797,7 @@ Bug Fixes
797797

798798
- Bug in ``groupby().shift()``, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (:issue:`13813`)
799799
- Bug in ``pd.read_csv()``, which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (:issue:`13703`)
800+
- Bug in ``pd.read_csv()``, which caused BOM files to be incorrectly parsed by not ignoring the BOM (:issue:`4793`)
800801
- Bug in ``io.json.json_normalize()``, where non-ascii keys raised an exception (:issue:`13213`)
801802
- Bug when passing a not-default-indexed ``Series`` as ``xerr`` or ``yerr`` in ``.plot()`` (:issue:`11858`)
802803
- Bug in matplotlib ``AutoDataFormatter``; this restores the second scaled formatting and re-adds micro-second scaled formatting (:issue:`13131`)

pandas/io/parsers.py

+75-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@
1111
import numpy as np
1212

1313
from pandas import compat
14-
from pandas.compat import range, lrange, StringIO, lzip, zip, string_types, map
14+
from pandas.compat import (range, lrange, StringIO, lzip,
15+
zip, string_types, map, u)
1516
from pandas.types.common import (is_integer, _ensure_object,
1617
is_list_like, is_integer_dtype,
1718
is_float,
@@ -40,6 +41,12 @@
4041
'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''
4142
])
4243

44+
# BOM character (byte order mark)
45+
# This exists at the beginning of a file to indicate endianness
46+
# of a file (stream). Unfortunately, this marker screws up parsing,
47+
# so we need to remove it if we see it.
48+
_BOM = u('\ufeff')
49+
4350
_parser_params = """Also supports optionally iterating or breaking of the file
4451
into chunks.
4552
@@ -2161,6 +2168,67 @@ def _buffered_line(self):
21612168
else:
21622169
return self._next_line()
21632170

2171+
def _check_for_bom(self, first_row):
2172+
"""
2173+
Checks whether the file begins with the BOM character.
2174+
If it does, remove it. In addition, if there is quoting
2175+
in the field subsequent to the BOM, remove it as well
2176+
because it technically takes place at the beginning of
2177+
the name, not the middle of it.
2178+
"""
2179+
# first_row will be a list, so we need to check
2180+
# that that list is not empty before proceeding.
2181+
if not first_row:
2182+
return first_row
2183+
2184+
# The first element of this row is the one that could have the
2185+
# BOM that we want to remove. Check that the first element is a
2186+
# string before proceeding.
2187+
if not isinstance(first_row[0], compat.string_types):
2188+
return first_row
2189+
2190+
# Check that the string is not empty, as that would
2191+
# obviously not have a BOM at the start of it.
2192+
if not first_row[0]:
2193+
return first_row
2194+
2195+
# Since the string is non-empty, check that it does
2196+
# in fact begin with a BOM.
2197+
first_elt = first_row[0][0]
2198+
2199+
# This is to avoid warnings we get in Python 2.x if
2200+
# we find ourselves comparing with non-Unicode
2201+
if compat.PY2 and not isinstance(first_elt, unicode): # noqa
2202+
try:
2203+
first_elt = u(first_elt)
2204+
except UnicodeDecodeError:
2205+
return first_row
2206+
2207+
if first_elt != _BOM:
2208+
return first_row
2209+
2210+
first_row = first_row[0]
2211+
2212+
if len(first_row) > 1 and first_row[1] == self.quotechar:
2213+
start = 2
2214+
quote = first_row[1]
2215+
end = first_row[2:].index(quote) + 2
2216+
2217+
# Extract the data between the quotation marks
2218+
new_row = first_row[start:end]
2219+
2220+
# Extract any remaining data after the second
2221+
# quotation mark.
2222+
if len(first_row) > end + 1:
2223+
new_row += first_row[end + 1:]
2224+
return [new_row]
2225+
elif len(first_row) > 1:
2226+
return [first_row[1:]]
2227+
else:
2228+
# First row is just the BOM, so we
2229+
# return an empty string.
2230+
return [""]
2231+
21642232
def _empty(self, line):
21652233
return not line or all(not x for x in line)
21662234

@@ -2212,6 +2280,12 @@ def _next_line(self):
22122280
line = ret[0]
22132281
break
22142282

2283+
# This was the first line of the file,
2284+
# which could contain the BOM at the
2285+
# beginning of it.
2286+
if self.pos == 1:
2287+
line = self._check_for_bom(line)
2288+
22152289
self.line_pos += 1
22162290
self.buf.append(line)
22172291
return line

pandas/io/tests/parser/common.py

+51
Original file line numberDiff line numberDiff line change
@@ -1517,3 +1517,54 @@ def test_null_byte_char(self):
15171517
msg = "NULL byte detected"
15181518
with tm.assertRaisesRegexp(csv.Error, msg):
15191519
self.read_csv(StringIO(data), names=cols)
1520+
1521+
def test_utf8_bom(self):
1522+
# see gh-4793
1523+
bom = u('\ufeff')
1524+
utf8 = 'utf-8'
1525+
1526+
def _encode_data_with_bom(_data):
1527+
bom_data = (bom + _data).encode(utf8)
1528+
return BytesIO(bom_data)
1529+
1530+
# basic test
1531+
data = 'a\n1'
1532+
expected = DataFrame({'a': [1]})
1533+
1534+
out = self.read_csv(_encode_data_with_bom(data),
1535+
encoding=utf8)
1536+
tm.assert_frame_equal(out, expected)
1537+
1538+
# test with "regular" quoting
1539+
data = '"a"\n1'
1540+
expected = DataFrame({'a': [1]})
1541+
1542+
out = self.read_csv(_encode_data_with_bom(data),
1543+
encoding=utf8, quotechar='"')
1544+
tm.assert_frame_equal(out, expected)
1545+
1546+
# test in a data row instead of header
1547+
data = 'b\n1'
1548+
expected = DataFrame({'a': ['b', '1']})
1549+
1550+
out = self.read_csv(_encode_data_with_bom(data),
1551+
encoding=utf8, names=['a'])
1552+
tm.assert_frame_equal(out, expected)
1553+
1554+
# test in empty data row with skipping
1555+
data = '\n1'
1556+
expected = DataFrame({'a': [1]})
1557+
1558+
out = self.read_csv(_encode_data_with_bom(data),
1559+
encoding=utf8, names=['a'],
1560+
skip_blank_lines=True)
1561+
tm.assert_frame_equal(out, expected)
1562+
1563+
# test in empty data row without skipping
1564+
data = '\n1'
1565+
expected = DataFrame({'a': [np.nan, 1.0]})
1566+
1567+
out = self.read_csv(_encode_data_with_bom(data),
1568+
encoding=utf8, names=['a'],
1569+
skip_blank_lines=False)
1570+
tm.assert_frame_equal(out, expected)

pandas/src/parser/tokenizer.c

+9
Original file line numberDiff line numberDiff line change
@@ -704,6 +704,11 @@ static int parser_buffer_bytes(parser_t *self, size_t nbytes) {
704704
self->datapos = i; \
705705
TRACE(("_TOKEN_CLEANUP: datapos: %d, datalen: %d\n", self->datapos, self->datalen));
706706

707+
#define CHECK_FOR_BOM() \
708+
if (*buf == '\xef' && *(buf + 1) == '\xbb' && *(buf + 2) == '\xbf') { \
709+
buf += 3; \
710+
self->datapos += 3; \
711+
}
707712

708713
int skip_this_line(parser_t *self, int64_t rownum) {
709714
if (self->skipset != NULL) {
@@ -736,6 +741,10 @@ int tokenize_bytes(parser_t *self, size_t line_limit)
736741

737742
TRACE(("%s\n", buf));
738743

744+
if (self->file_lines == 0) {
745+
CHECK_FOR_BOM();
746+
}
747+
739748
for (i = self->datapos; i < self->datalen; ++i)
740749
{
741750
// next character in file

0 commit comments

Comments
 (0)