BUG: Fix to_json lines with escaped characters #15117

rouzazari · 2017-01-12T14:56:47Z

Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

Bug description: A simple check of whether the prior char is a backslash
is insufficient because the backslash may itself be escaped.

A test is also included (previously included in #14693).

xref #14693
xref #15096

closes to_json() line separation broken by backslash in content #15096 (BUG: Fix for json lines issue with backslashed quotes #14693 was previously closed by author)
tests added / passed
passes git diff upstream/master | flake8 --diff (used pep8radius master --diff instead)
whatsnew entry (does not seem necessary, but let me know if I should add)

codecov-io · 2017-01-12T16:27:18Z

Current coverage is 84.75% (diff: 100%)

No coverage report found for master at ab0d236.

Powered by Codecov. Last update ab0d236...92f4e01

jreback · 2017-01-12T19:13:16Z

pandas/lib.pyx

@@ -1099,6 +1099,7 @@ def convert_json_to_lines(object arr):
    """
    cdef:
        Py_ssize_t i = 0, num_open_brackets_seen = 0, in_quotes = 0, length
+        Py_ssize_t is_escaping = 0


make in_quotes and is_escaping of type bint

jreback · 2017-01-12T19:14:04Z

pls add a whatsnew note (0.20.0 bug fix section)

rouzazari · 2017-01-13T01:05:48Z

Thanks, @jreback. I made the changes you requested, rebased to master, and squashed the commits.

Please note: I revised the test to include a test for escaped characters in keys and values (i.e. columns and data of dataframe). Previously I was only testing escaped characters in values/data.

jreback · 2017-01-13T13:54:49Z

doc/source/whatsnew/v0.20.0.txt

@@ -351,6 +351,7 @@ Bug Fixes
 - Require at least 0.23 version of cython to avoid problems with character encodings (:issue:`14699`)
 - Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`, :issue:`14982`)
 - Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)
+- Bug in ``.to_json()`` where lines=True and contents (keys or values) contain escaped characters (:issue:`15096`)


use double-backticks around lines=True

jreback · 2017-01-13T13:55:02Z

pandas/io/tests/json/test_pandas.py

@@ -972,6 +972,13 @@ def test_to_jsonl(self):
        self.assertEqual(result, expected)
        assert_frame_equal(pd.read_json(result, lines=True), df)

+        # GH15096


add a 1-line comment on what you are testing here

jreback · 2017-01-13T13:55:38Z

lgtm, minor comments and some linting issues to fix. ping on green.

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly. - Includes test for escaped characters in keys and values (i.e. columns and data). - Includes bug fix in whatsnew - Revised type of in_quotes and is_escaping to bint xref pandas-dev#14693 xref pandas-dev#15096

jreback · 2017-01-13T20:10:28Z

thanks! nice concise PR! keep em coming

rouzazari · 2017-01-13T20:17:17Z

Thanks, @jreback. I appreciate the reviews and feedback.

FYI: There is no change in runtime performance from the bug fix. This is somewhat obvious, but it was a good exercise for me, so I thought I'd share. I reran the following performance tests from #14408:

Master:

In [5]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]))
In [6]: %timeit df.to_json('foo.json',orient='records',lines=True)
10 loops, best of 3: 135 ms per loop

This branch:

In [5]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]))
In [6]: %timeit df.to_json('foo.json',orient='records',lines=True)
10 loops, best of 3: 135 ms per loop

jreback · 2017-01-13T20:20:01Z

@rouzazari great!

note that we already have some asv's for perf testing, I am not sure how much they cover for json though, you can have a look in pandas/asv/benchmarks/packers.py

if you think some cases need more, then please PR!

rouzazari · 2017-01-13T20:25:17Z

Got it. I'm not too familiar with this area, but I'll take a look.

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly. xref pandas-dev#14693 closes pandas-dev#15096 Author: Rouz Azari <[email protected]> Closes pandas-dev#15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits: d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters

jreback reviewed Jan 12, 2017

View reviewed changes

jreback added Bug IO JSON read_json, to_json, json_normalize labels Jan 12, 2017

rouzazari force-pushed the to_json_lines_with_escaping branch from 60bb9da to 92f4e01 Compare January 13, 2017 00:51

jreback reviewed Jan 13, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Jan 13, 2017

rouzazari force-pushed the to_json_lines_with_escaping branch from 92f4e01 to d114455 Compare January 13, 2017 18:20

jreback closed this in 1a18420 Jan 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Fix to_json lines with escaped characters #15117

BUG: Fix to_json lines with escaped characters #15117

Uh oh!

rouzazari commented Jan 12, 2017 •

edited

Loading

Uh oh!

codecov-io commented Jan 12, 2017 •

edited

Loading

Uh oh!

jreback Jan 12, 2017

Uh oh!

jreback commented Jan 12, 2017

Uh oh!

rouzazari commented Jan 13, 2017

Uh oh!

jreback Jan 13, 2017

Uh oh!

jreback Jan 13, 2017

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

rouzazari commented Jan 13, 2017

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

rouzazari commented Jan 13, 2017

Uh oh!

Uh oh!

Uh oh!

BUG: Fix to_json lines with escaped characters #15117

BUG: Fix to_json lines with escaped characters #15117

Uh oh!

Conversation

rouzazari commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 84.75% (diff: 100%)

Uh oh!

jreback Jan 12, 2017

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 12, 2017

Uh oh!

rouzazari commented Jan 13, 2017

Uh oh!

jreback Jan 13, 2017

Choose a reason for hiding this comment

Uh oh!

jreback Jan 13, 2017

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

rouzazari commented Jan 13, 2017

Uh oh!

jreback commented Jan 13, 2017

Uh oh!

rouzazari commented Jan 13, 2017

Uh oh!

Uh oh!

rouzazari commented Jan 12, 2017 •

edited

Loading

codecov-io commented Jan 12, 2017 •

edited

Loading