Skip to content

[BUG] handle } in line delimited json #14391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

joshowen
Copy link

@joshowen joshowen commented Oct 10, 2016

@joshowen joshowen changed the title [BUG] fix for quoted special characters [BUG] fix for quoted special characters in line delimited json Oct 10, 2016
@joshowen joshowen changed the title [BUG] fix for quoted special characters in line delimited json [BUG] handle } in line delimited json Oct 10, 2016
@codecov-io
Copy link

codecov-io commented Oct 11, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14391 into master will increase coverage by <.01%

@@             master     #14391   diff @@
==========================================
  Files           140        140          
  Lines         50634      50639     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43173      43178     +5   
  Misses         7461       7461          
  Partials          0          0          

Powered by Codecov. Last update d98e982...edb1488

@jreback jreback added Bug IO JSON read_json, to_json, json_normalize labels Oct 11, 2016
@jreback
Copy link
Contributor

jreback commented Oct 11, 2016

can you add a benchmark for lines=True in asv_bench/benchmarks/packers.py (alongside the other json ones)

@joshowen
Copy link
Author

Not sure why that test failed,, the only change was a line of whitespace

self.f = '__test__.msg'
self.N = 100000
self.C = 5
self.index = date_range('20000101', periods=self.N, freq='H')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like some things got repeated here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like self.N/self.C are repeated in most of these classes. Want me to clean them all up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure that would be great (you can also make a common base class(s) if that helps as well)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshowen you can leave it here as is. I cleaned this up in another PR (@jreback yes I know, I should merge that ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine (though @joshowen make sure your example doesn't have dups as this is new code)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, it is of course OK to remove the lines in this added code that you do not need for this benchmark

df = DataFrame([["foo}", "bar"], ['foo"', "bar"]], columns=['a', 'b'])
result = df.to_json(orient="records", lines=True)
expected = '{"a":"foo}","b":"bar"}\n{"a":"foo\\"","b":"bar"}'
self.assertEqual(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also round trip it and user assert_frame_equal on the result (in addition to the above test)

@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 12, 2016
@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

@joshowen can you also post a run for this benchmark (versus previous); can also do it in a %timeit as well. Just checking if any perf issues.

@joshowen
Copy link
Author

@jreback is there an easy way to do that? Or should I port the asv test to master and run/compare?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

you can run asv if you want, otherwise just do it in ipython (before and after), e.g. something like

In [6]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]), index=index).reset_index(drop=True)

In [7]: df.to_json('foo.json',orient='records',lines=True)

In [8]: %timeit df.to_json('foo.json',orient='records',lines=True)
1 loop, best of 3: 3.66 s per loop

In [9]: %timeit df.to_json('foo.json',orient='records')
10 loops, best of 3: 98.8 ms per loop

and looking at this, we have a BIG perf hit when lines=True (this is a separate issue).

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

cc @aterrel

@jreback
Copy link
Contributor

jreback commented Oct 12, 2016

@joshowen can you open another issue about the perf

In [15]: %prun df.to_json('foo.json',orient='records',lines=True)
         100058 function calls (100056 primitive calls) in 3.759 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.017    2.017    3.656    3.656 json.py:600(_convert_to_line_delimits)
        1    1.002    1.002    1.002    1.002 {method 'join' of 'str' objects}
        1    0.630    0.630    0.630    0.630 {numpy.core.multiarray.array}
        1    0.078    0.078    0.078    0.078 {pandas.json.dumps}
        1    0.012    0.012    0.012    0.012 {method 'write' of 'file' objects}

something odd going on here

@jreback
Copy link
Contributor

jreback commented Oct 14, 2016

lgtm. @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

Thanks @joshowen !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Line delimited json is breaks if string includes }
4 participants