Skip to content

BUG: Fix for json lines issue with backslashed quotes #14693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

joshowen
Copy link

@joshowen joshowen commented Nov 19, 2016

  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

This is an additional fix to:
#14429
#14391

@codecov-io
Copy link

codecov-io commented Nov 19, 2016

Current coverage is 85.20% (diff: 100%)

Merging #14693 into master will increase coverage by <.01%

@@             master     #14693   diff @@
==========================================
  Files           143        143          
  Lines         50787      50787          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43273      43274     +1   
+ Misses         7514       7513     -1   
  Partials          0          0          

Powered by Codecov. Last update f26b049...9908a7c

@joshowen joshowen changed the title BUG: Fix for json lines issue with backslash quotes BUG: Fix for json lines issue with backslashed quotes Nov 21, 2016
@jreback jreback added the IO JSON read_json, to_json, json_normalize label Nov 21, 2016
@@ -59,7 +59,7 @@ Bug Fixes
- Bug in clipboard functions on Windows 10 and python 3 (:issue:`14362`, :issue:`12807`)
- Bug in ``.to_clipboard()`` and Excel compat (:issue:`12529`)


- Bug in to_json with lines=true containing backslashed quotes (:issue:`14693`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.to_json() with lines=True specified, containing .....

@jreback jreback added this to the 0.19.2 milestone Nov 21, 2016
result = df.to_json(orient="records", lines=True)
expected = '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}'
expected = ('{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}\n'
'{"a":"foo\\\\","b":"bar"}')
self.assertEqual(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [5]: df.to_json(lines=True,orient='records')
Out[5]: '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}\n{"a":"foo\\\\","b":"bar"}'

is on current master. what is different?

@joshowen
Copy link
Author

This fixed an edge case, but ultimately there are a bunch more. I'm moving away from using this to the following code:

        from io import StringIO
        import jsonlines
        with StringIO() as buf:
            writer = jsonlines.Writer(buf)
            writer.write_all(ujson.loads(s))
            writer.close()
            buf.seek(0)
            return buf.read()

Their solution is fairly simple, but I'm not comfortable enough to update the vendored ujson package.
https://github.com/wbolster/jsonlines/blob/master/jsonlines/jsonlines.py

I've added the following PR to speed it up a bit:
wbolster/jsonlines#24

@joshowen joshowen closed this Nov 21, 2016
@joshowen
Copy link
Author

@jreback I think the right way to do this is to use jsonlines or to build its functionality into ujson rather than trying to transform the json formatted output. What do you think?

@wbolster
Copy link

jfyi, StringIO.getvalue() is typically used to obtain the value as a a string, instead of .seek(0) and .read().

@wbolster
Copy link

also, there is no need to use StringIO as a context manager, while it does make sense to use jsonlines.Writer as a context manager:

import io
import jsonlines

buf = io.StringIO()
with jsonlines.Writer(buf) as writer:
    writer.write_all(ujson.loads(s))
return buf.getvalue()

@jreback
Copy link
Contributor

jreback commented Nov 22, 2016

@joshowen

@jreback I think the right way to do this is to use jsonlines or to build its functionality into ujson rather than trying to transform the json formatted output. What do you think?

IIRC from the original issue, @aterrel and I had discussed this. Though its pretty performant now, the correct approach is to put it in the custom ujson code that pandas uses. That is somewhat more involved (though probably pretty straightforward).

rouzazari added a commit to rouzazari/pandas that referenced this pull request Jan 12, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

Bug description: A simple check of whether the prior char is a backslash
is insufficient because the backslash may itself be escaped.

A test is also included (previously included in pandas-dev#14693).

xref pandas-dev#14693
xref pandas-dev#15096
rouzazari added a commit to rouzazari/pandas that referenced this pull request Jan 13, 2017
Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly.

- Includes test for escaped characters in keys and values (i.e. columns and data).
- Includes bug fix in whatsnew
- Revised type of in_quotes and is_escaping to bint

xref pandas-dev#14693
xref pandas-dev#15096
rouzazari added a commit to rouzazari/pandas that referenced this pull request Jan 13, 2017
Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly.

- Includes test for escaped characters in keys and values (i.e. columns and data).
- Includes bug fix in whatsnew
- Revised type of in_quotes and is_escaping to bint

xref pandas-dev#14693
xref pandas-dev#15096
jreback pushed a commit that referenced this pull request Jan 13, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

xref #14693

closes #15096

Author: Rouz Azari <[email protected]>

Closes #15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits:

d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

xref pandas-dev#14693

closes pandas-dev#15096

Author: Rouz Azari <[email protected]>

Closes pandas-dev#15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits:

d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants