Skip to content

BUG: Fix to_json lines with escaped characters #15117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

rouzazari
Copy link
Contributor

@rouzazari rouzazari commented Jan 12, 2017

Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

Bug description: A simple check of whether the prior char is a backslash
is insufficient because the backslash may itself be escaped.

A test is also included (previously included in #14693).

xref #14693
xref #15096

@codecov-io
Copy link

codecov-io commented Jan 12, 2017

Current coverage is 84.75% (diff: 100%)

No coverage report found for master at ab0d236.

Powered by Codecov. Last update ab0d236...92f4e01

@@ -1099,6 +1099,7 @@ def convert_json_to_lines(object arr):
"""
cdef:
Py_ssize_t i = 0, num_open_brackets_seen = 0, in_quotes = 0, length
Py_ssize_t is_escaping = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make in_quotes and is_escaping of type bint

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

pls add a whatsnew note (0.20.0 bug fix section)

@jreback jreback added Bug IO JSON read_json, to_json, json_normalize labels Jan 12, 2017
@rouzazari rouzazari force-pushed the to_json_lines_with_escaping branch from 60bb9da to 92f4e01 Compare January 13, 2017 00:51
@rouzazari
Copy link
Contributor Author

Thanks, @jreback. I made the changes you requested, rebased to master, and squashed the commits.

Please note: I revised the test to include a test for escaped characters in keys and values (i.e. columns and data of dataframe). Previously I was only testing escaped characters in values/data.

@@ -351,6 +351,7 @@ Bug Fixes
- Require at least 0.23 version of cython to avoid problems with character encodings (:issue:`14699`)
- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`, :issue:`14982`)
- Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)
- Bug in ``.to_json()`` where lines=True and contents (keys or values) contain escaped characters (:issue:`15096`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-backticks around lines=True

@@ -972,6 +972,13 @@ def test_to_jsonl(self):
self.assertEqual(result, expected)
assert_frame_equal(pd.read_json(result, lines=True), df)

# GH15096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 1-line comment on what you are testing here

@jreback jreback added this to the 0.20.0 milestone Jan 13, 2017
@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

lgtm, minor comments and some linting issues to fix. ping on green.

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly.

- Includes test for escaped characters in keys and values (i.e. columns and data).
- Includes bug fix in whatsnew
- Revised type of in_quotes and is_escaping to bint

xref pandas-dev#14693
xref pandas-dev#15096
@rouzazari rouzazari force-pushed the to_json_lines_with_escaping branch from 92f4e01 to d114455 Compare January 13, 2017 18:20
@jreback jreback closed this in 1a18420 Jan 13, 2017
@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

thanks! nice concise PR! keep em coming

@rouzazari
Copy link
Contributor Author

Thanks, @jreback. I appreciate the reviews and feedback.

FYI: There is no change in runtime performance from the bug fix. This is somewhat obvious, but it was a good exercise for me, so I thought I'd share. I reran the following performance tests from #14408:

Master:

In [5]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]))
In [6]: %timeit df.to_json('foo.json',orient='records',lines=True)
10 loops, best of 3: 135 ms per loop

This branch:

In [5]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]))
In [6]: %timeit df.to_json('foo.json',orient='records',lines=True)
10 loops, best of 3: 135 ms per loop

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

@rouzazari great!

note that we already have some asv's for perf testing, I am not sure how much they cover for json though, you can have a look in pandas/asv/benchmarks/packers.py

if you think some cases need more, then please PR!

@rouzazari
Copy link
Contributor Author

Got it. I'm not too familiar with this area, but I'll take a look.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

xref pandas-dev#14693

closes pandas-dev#15096

Author: Rouz Azari <[email protected]>

Closes pandas-dev#15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits:

d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_json() line separation broken by backslash in content
3 participants