BUG: Fix for json lines issue with backslashed quotes #14693

joshowen · 2016-11-19T00:08:29Z

tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This is an additional fix to:
#14429
#14391

codecov-io · 2016-11-19T05:58:57Z

Current coverage is 85.20% (diff: 100%)

Merging #14693 into master will increase coverage by <.01%

@@             master     #14693   diff @@
==========================================
  Files           143        143          
  Lines         50787      50787          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43273      43274     +1   
+ Misses         7514       7513     -1   
  Partials          0          0

Powered by Codecov. Last update f26b049...9908a7c

jreback · 2016-11-21T11:42:46Z

doc/source/whatsnew/v0.19.2.txt

@@ -59,7 +59,7 @@ Bug Fixes
 - Bug in clipboard functions on Windows 10 and python 3 (:issue:`14362`, :issue:`12807`)
 - Bug in ``.to_clipboard()`` and Excel compat (:issue:`12529`)

-
+- Bug in to_json with lines=true containing backslashed quotes (:issue:`14693`)


.to_json() with lines=True specified, containing .....

jreback · 2016-11-21T11:47:56Z

pandas/io/tests/json/test_pandas.py

        result = df.to_json(orient="records", lines=True)
-        expected = '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}'
+        expected = ('{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}\n'
+                    '{"a":"foo\\\\","b":"bar"}')
        self.assertEqual(result, expected)


In [5]: df.to_json(lines=True,orient='records') Out[5]: '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}\n{"a":"foo\\\\","b":"bar"}'

is on current master. what is different?

joshowen · 2016-11-21T16:56:27Z

This fixed an edge case, but ultimately there are a bunch more. I'm moving away from using this to the following code:

        from io import StringIO
        import jsonlines
        with StringIO() as buf:
            writer = jsonlines.Writer(buf)
            writer.write_all(ujson.loads(s))
            writer.close()
            buf.seek(0)
            return buf.read()

Their solution is fairly simple, but I'm not comfortable enough to update the vendored ujson package.
https://github.com/wbolster/jsonlines/blob/master/jsonlines/jsonlines.py

I've added the following PR to speed it up a bit:
wbolster/jsonlines#24

joshowen · 2016-11-21T17:09:06Z

@jreback I think the right way to do this is to use jsonlines or to build its functionality into ujson rather than trying to transform the json formatted output. What do you think?

wbolster · 2016-11-21T19:04:21Z

jfyi, StringIO.getvalue() is typically used to obtain the value as a a string, instead of .seek(0) and .read().

wbolster · 2016-11-22T09:51:17Z

also, there is no need to use StringIO as a context manager, while it does make sense to use jsonlines.Writer as a context manager:

import io
import jsonlines

buf = io.StringIO()
with jsonlines.Writer(buf) as writer:
    writer.write_all(ujson.loads(s))
return buf.getvalue()

jreback · 2016-11-22T11:43:16Z

@joshowen

@jreback I think the right way to do this is to use jsonlines or to build its functionality into ujson rather than trying to transform the json formatted output. What do you think?

IIRC from the original issue, @aterrel and I had discussed this. Though its pretty performant now, the correct approach is to put it in the custom ujson code that pandas uses. That is somewhat more involved (though probably pretty straightforward).

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly. Bug description: A simple check of whether the prior char is a backslash is insufficient because the backslash may itself be escaped. A test is also included (previously included in pandas-dev#14693). xref pandas-dev#14693 xref pandas-dev#15096

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly. - Includes test for escaped characters in keys and values (i.e. columns and data). - Includes bug fix in whatsnew - Revised type of in_quotes and is_escaping to bint xref pandas-dev#14693 xref pandas-dev#15096

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly. xref #14693 closes #15096 Author: Rouz Azari <[email protected]> Closes #15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits: d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly. xref pandas-dev#14693 closes pandas-dev#15096 Author: Rouz Azari <[email protected]> Closes pandas-dev#15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits: d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters

Josh Owen added 11 commits October 26, 2016 17:40

Merge remote-tracking branch 'pandas-dev/master'

9092015

handle edge case where prior character is an escaped backslash

a5ee0f2

avoid out of bounds

2d06e25

just check that i > 1, add test

3a7bc17

correctly handle trailing backslash

b198a78

fixed typo

933dd50

yet another logic change

cc86b35

Merge remote-tracking branch 'pandas-dev/master' into fix-lines-2

f069e9f

fixed expected data

87b2798

Merge remote-tracking branch 'pandas-dev/master' into fix-lines-2

8548720

lint

abd9e82

joshowen changed the title ~~BUG: Fix for json lines issue with backslash quotes~~ BUG: Fix for json lines issue with backslashed quotes Nov 21, 2016

Josh Owen added 2 commits November 20, 2016 20:14

Merge remote-tracking branch 'pandas-dev/master' into fix-lines-2

a91f1de

added whatsnew entry

9908a7c

jreback added the IO JSON read_json, to_json, json_normalize label Nov 21, 2016

jreback reviewed Nov 21, 2016

View reviewed changes

jreback added this to the 0.19.2 milestone Nov 21, 2016

jreback reviewed Nov 21, 2016

View reviewed changes

joshowen closed this Nov 21, 2016

jreback mentioned this pull request Jan 10, 2017

to_json() line separation broken by backslash in content #15096

Closed

rouzazari mentioned this pull request Jan 12, 2017

BUG: Fix to_json lines with escaped characters #15117

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Fix for json lines issue with backslashed quotes #14693

BUG: Fix for json lines issue with backslashed quotes #14693

Uh oh!

joshowen commented Nov 19, 2016 •

edited

Loading

Uh oh!

codecov-io commented Nov 19, 2016 •

edited

Loading

Uh oh!

jreback Nov 21, 2016

Uh oh!

jreback Nov 21, 2016

Uh oh!

joshowen commented Nov 21, 2016

Uh oh!

joshowen commented Nov 21, 2016

Uh oh!

wbolster commented Nov 21, 2016

Uh oh!

wbolster commented Nov 22, 2016

Uh oh!

jreback commented Nov 22, 2016

Uh oh!

Uh oh!

Uh oh!

BUG: Fix for json lines issue with backslashed quotes #14693

BUG: Fix for json lines issue with backslashed quotes #14693

Uh oh!

Conversation

joshowen commented Nov 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Nov 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 85.20% (diff: 100%)

Uh oh!

jreback Nov 21, 2016

Choose a reason for hiding this comment

Uh oh!

jreback Nov 21, 2016

Choose a reason for hiding this comment

Uh oh!

joshowen commented Nov 21, 2016

Uh oh!

joshowen commented Nov 21, 2016

Uh oh!

wbolster commented Nov 21, 2016

Uh oh!

wbolster commented Nov 22, 2016

Uh oh!

jreback commented Nov 22, 2016

Uh oh!

Uh oh!

joshowen commented Nov 19, 2016 •

edited

Loading

codecov-io commented Nov 19, 2016 •

edited

Loading