Skip to content

BUG: Fix to_json lines with escaped characters #15117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,7 @@ Bug Fixes
- Require at least 0.23 version of cython to avoid problems with character encodings (:issue:`14699`)
- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`, :issue:`14982`)
- Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)
- Bug in ``.to_json()`` where ``lines=True`` and contents (keys or values) contain escaped characters (:issue:`15096`)

- Bug in ``DataFrame.groupby().describe()`` when grouping on ``Index`` containing tuples (:issue:`14848`)
- Bug in creating a ``MultiIndex`` with tuples and not passing a list of names; this will now raise ``ValueError`` (:issue:`15110`)
Expand Down
9 changes: 9 additions & 0 deletions pandas/io/tests/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -972,6 +972,15 @@ def test_to_jsonl(self):
self.assertEqual(result, expected)
assert_frame_equal(pd.read_json(result, lines=True), df)

# GH15096: escaped characters in columns and data
df = DataFrame([["foo\\", "bar"], ['foo"', "bar"]],
columns=["a\\", 'b'])
result = df.to_json(orient="records", lines=True)
expected = ('{"a\\\\":"foo\\\\","b":"bar"}\n'
'{"a\\\\":"foo\\"","b":"bar"}')
self.assertEqual(result, expected)
assert_frame_equal(pd.read_json(result, lines=True), df)

def test_latin_encoding(self):
if compat.PY2:
self.assertRaisesRegexp(
Expand Down
7 changes: 5 additions & 2 deletions pandas/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1098,7 +1098,8 @@ def convert_json_to_lines(object arr):
to quotes & brackets
"""
cdef:
Py_ssize_t i = 0, num_open_brackets_seen = 0, in_quotes = 0, length
Py_ssize_t i = 0, num_open_brackets_seen = 0, length
bint in_quotes = 0, is_escaping = 0
ndarray[uint8_t] narr
unsigned char v, comma, left_bracket, right_brack, newline

Expand All @@ -1113,8 +1114,10 @@ def convert_json_to_lines(object arr):
length = narr.shape[0]
for i in range(length):
v = narr[i]
if v == quote and i > 0 and narr[i - 1] != backslash:
if v == quote and i > 0 and not is_escaping:
in_quotes = ~in_quotes
if v == backslash or is_escaping:
is_escaping = ~is_escaping
if v == comma: # commas that should be \n
if num_open_brackets_seen == 0 and not in_quotes:
narr[i] = newline
Expand Down