Skip to content

to_json() line separation broken by backslash in content #15096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nkorlin opened this issue Jan 10, 2017 · 1 comment
Closed

to_json() line separation broken by backslash in content #15096

nkorlin opened this issue Jan 10, 2017 · 1 comment
Labels
Bug IO JSON read_json, to_json, json_normalize Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@nkorlin
Copy link

nkorlin commented Jan 10, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
>>> df = pd.read_json("file://localhost/home/ubuntu/test.json", lines=True)
>>> df.to_json("test-out.json", orient="records", lines=True)
>>> df = pd.read_json("file://localhost/home/ubuntu/test2.json", lines=True)
>>> df.to_json("test2-out.json", orient="records", lines=True)

Problem description

I noticed this issue when I saw that a ~8500 row dframe became a ~3400 line file, even though in theory "lines=True" should mean that the number of rows and number of lines are equal. I dug into the code a bit, and it turns out that convert_json_to_lines() does not correctly insert newlines if the json contains a backslash before a double quote, even if the backslash itself is escaped.
(https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/lib.pyx#L1114)

Here are the test files I used with the code sample above, and the outputs I got:

ubuntu@ip:~$ cat test.json 
{"test": "qwer"}
{"test": "asdf \\"}
{"test": "zxcv"}
ubuntu@ip:~$ cat test-out.json 
{"test":"qwer"}
{"test":"asdf \\"},{"test":"zxcv"}ubuntu@ip:~$ 
ubuntu@ip:~$ cat test2.json
{"test": "123"}
{"test": "234 \\"}
{"test": "345"}
{"test": "456 \\"}
{"test": "567"}
ubuntu@ip:~$ cat test2-out.json
{"test":"123"}
{"test":"234 \\"},{"test":"345"},{"test":"456 \\"}
{"test":"567"}ubuntu@ip:~$ 

Expected Output

"-out" files should be identical to originals.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-68-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: 2.43.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

so there is a PR #14693 which fixes this, though it was closed by its author. would welcome to pick it up.

@jreback jreback added Bug Difficulty Novice IO JSON read_json, to_json, json_normalize Output-Formatting __repr__ of pandas objects, to_string labels Jan 10, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 10, 2017
rouzazari added a commit to rouzazari/pandas that referenced this issue Jan 12, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

Bug description: A simple check of whether the prior char is a backslash
is insufficient because the backslash may itself be escaped.

A test is also included (previously included in pandas-dev#14693).

xref pandas-dev#14693
xref pandas-dev#15096
rouzazari added a commit to rouzazari/pandas that referenced this issue Jan 13, 2017
Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly.

- Includes test for escaped characters in keys and values (i.e. columns and data).
- Includes bug fix in whatsnew
- Revised type of in_quotes and is_escaping to bint

xref pandas-dev#14693
xref pandas-dev#15096
rouzazari added a commit to rouzazari/pandas that referenced this issue Jan 13, 2017
Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly.

- Includes test for escaped characters in keys and values (i.e. columns and data).
- Includes bug fix in whatsnew
- Revised type of in_quotes and is_escaping to bint

xref pandas-dev#14693
xref pandas-dev#15096
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

xref pandas-dev#14693

closes pandas-dev#15096

Author: Rouz Azari <[email protected]>

Closes pandas-dev#15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits:

d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants