Skip to content

Commit 2b5058e

Browse files
bashtagejreback
authored andcommitted
BUG: Fix GSO values when writing latin-1 strLs (#24337)
The size calculation of the string is incorrect when writing characters that have a different encoding in latin-1 and utf-8. The utf-8 size needs to be written in stead of the latin-1 size.
1 parent 380eaf1 commit 2b5058e

File tree

3 files changed

+25
-5
lines changed

3 files changed

+25
-5
lines changed

doc/source/whatsnew/v0.24.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1576,6 +1576,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
15761576
- :func:`DataFrame.to_string()`, :func:`DataFrame.to_html()`, :func:`DataFrame.to_latex()` will correctly format output when a string is passed as the ``float_format`` argument (:issue:`21625`, :issue:`22270`)
15771577
- Bug in :func:`read_csv` that caused it to raise ``OverflowError`` when trying to use 'inf' as ``na_value`` with integer index column (:issue:`17128`)
15781578
- Bug in :func:`pandas.io.json.json_normalize` that caused it to raise ``TypeError`` when two consecutive elements of ``record_path`` are dicts (:issue:`22706`)
1579+
- Bug in :meth:`DataFrame.to_stata` and :class:`pandas.io.stata.StataWriter117` that produced invalid files when using strLs with non-ASCII characters (:issue:`23573`)
15791580

15801581
Plotting
15811582
^^^^^^^^

pandas/io/stata.py

+3-4
Original file line numberDiff line numberDiff line change
@@ -2643,12 +2643,11 @@ def generate_blob(self, gso_table):
26432643
bio.write(gso_type)
26442644

26452645
# llll
2646-
encoded = self._encode(strl)
2647-
bio.write(struct.pack(len_type, len(encoded) + 1))
2646+
utf8_string = _bytes(strl, 'utf-8')
2647+
bio.write(struct.pack(len_type, len(utf8_string) + 1))
26482648

26492649
# xxx...xxx
2650-
s = _bytes(strl, 'utf-8')
2651-
bio.write(s)
2650+
bio.write(utf8_string)
26522651
bio.write(null)
26532652

26542653
bio.seek(0)

pandas/tests/io/test_stata.py

+21-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
import pandas as pd
1717
import pandas.util.testing as tm
1818
import pandas.compat as compat
19-
from pandas.compat import iterkeys
19+
from pandas.compat import iterkeys, PY3
2020
from pandas.core.dtypes.common import is_categorical_dtype
2121
from pandas.core.frame import DataFrame, Series
2222
from pandas.io.parsers import read_csv
@@ -1546,3 +1546,23 @@ def test_all_none_exception(self, version):
15461546
output.to_stata(path, version=version)
15471547
assert 'Only string-like' in excinfo.value.args[0]
15481548
assert 'Column `none`' in excinfo.value.args[0]
1549+
1550+
def test_strl_latin1(self):
1551+
# GH 23573, correct GSO data to reflect correct size
1552+
output = DataFrame([[u'pandas'] * 2, [u'þâÑÐŧ'] * 2],
1553+
columns=['var_str', 'var_strl'])
1554+
1555+
with tm.ensure_clean() as path:
1556+
output.to_stata(path, version=117, convert_strl=['var_strl'])
1557+
with open(path, 'rb') as reread:
1558+
content = reread.read()
1559+
expected = u'þâÑÐŧ'
1560+
assert expected.encode('latin-1') in content
1561+
assert expected.encode('utf-8') in content
1562+
gsos = content.split(b'strls')[1][1:-2]
1563+
for gso in gsos.split(b'GSO')[1:]:
1564+
val = gso.split(b'\x00')[-2]
1565+
size = gso[gso.find(b'\x82') + 1]
1566+
if not PY3:
1567+
size = ord(size)
1568+
assert len(val) == size - 1

0 commit comments

Comments
 (0)