Add nrows to read json. #33962

hasnain2808 · 2020-05-04T05:19:02Z

closes ENH: Add nrows parameter to pd.read_json #33916
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Add the nrows to read_json parameter that returns only the required number of json from the line delimited json

…ev#33916

…d json for read_json pandas-dev#33916

pandas/io/json/_json.py

hasnain2808 · 2020-05-07T13:09:19Z

Hey @jreback
I have made all the change
As for typing args I tried for All of them but looks like I don't know much, so added as much as possible including nrows that were made necessary. Will try raising a new PR which will add typing hint for all args later.
I hope everything's in order
It is ready to be reviewed
Regards.

pandas/io/json/_json.py

Co-authored-by: William Ayd <[email protected]>

pandas/io/json/_json.py

hasnain2808 · 2020-06-02T18:50:18Z

Hey @jreback

Hope you are having a great day

So this PR is basically for the addition of the feature of selecting the number of rows to be returned when the lines parameter is set.

According to the benchmarks I believe that feature is working fine.

So is it possible that we merge this feature and then create a new issue for the concerns that we are having with the chunk size feature? I have got a grip on this section of the codebase so I should be able to do that too but I believe we should have separate PR's for them.

Thanks and Regards

jreback · 2020-06-02T22:27:04Z

@hasnain2808 yep that sounds fine. please add a whatsnew (about the nrows feature) and create an issue about chunksize & perf which can certainly be a followup.

ping on green (for the added whatsnew), put in other enhancements in 1.1

…d-nrows-to-read-json issues with merging

hasnain2808 · 2020-06-03T14:09:19Z

Hey @jreback
Added the whatsnew and it's green
I hope everything's in order to merge.
Thanks.

WillAyd · 2020-06-03T23:31:44Z

@hasnain2808 looks like another merge conflict - can you fix up?

So the benchmarks did show an improvement right? Not entirely clear from the link that you sent; usually at the end of the run you should get something saying PERFORMANCE IMPROVED . I might be overlooking that from what you shared

As long as that shows an improvement this lgtm

hasnain2808 · 2020-06-04T00:20:05Z

@WillAyd

Hope you're having a nice day

I will fix the merge conflicts

Yes the memory consumption and time required is low when we use the nrows parameter.

As the nrows parameter is optional, we added a new benchmark for this parameter hence, the benchmark results do not show the PERFORMANCE IMPROVED verdict.

Pasted the results into the spreadsheet to compare easily
Link

Raw dumps are here
https://pastebin.com/E22PAxKe

Please do tell me if there is something else that's needed to be done

…d-nrows-to-read-json "solve merge conflicts while merging to master"

hasnain2808 · 2020-06-04T01:01:53Z

Fixed the merge conflicts.
I hope we merge this before some one else adds a what's new into master

hasnain2808 · 2020-06-04T13:05:30Z

I do not think this failure is related
It started coming once I merged master into my branch to resolve merge conflicts

The benchmark is failing around 0.64%

##[error][  0.65%] ··· arithmetic.ApplyIndex.time_apply_index                 3/10 failed
[  0.65%] ··· =================================== ==========
                             offset                         
              ----------------------------------- ----------
                      <YearEnd: month=12>          1.55±0ms 
                      <YearBegin: month=1>         1.37±0ms 
                 <QuarterEnd: startingMonth=3>     1.71±0ms 
                <QuarterBegin: startingMonth=3>    1.59±0ms 
                           <MonthEnd>              2.25±0ms 
                          <MonthBegin>             1.34±0ms 
                 <DateOffset: days=2, months=2>    3.16±0ms 
                         <BusinessDay>              failed  
                <SemiMonthEnd: day_of_month=15>     failed  
               <SemiMonthBegin: day_of_month=15>    failed  
              =================================== ==========

[  0.65%] ···· For parameters: <BusinessDay>
               Traceback (most recent call last):
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1184, in main_run_server
                   main_run(run_args)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1058, in main_run
                   result = benchmark.do_run()
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 537, in do_run
                   return self.run(*self._current_params)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 627, in run
                   samples, number = self.benchmark_timing(timer, min_repeat, max_repeat,
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 694, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/timeit.py", line 177, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 599, in <lambda>
                   func = lambda: self.func(*param)
                 File "/home/runner/work/pandas/pandas/asv_bench/benchmarks/arithmetic.py", line 469, in time_apply_index
                   offset.apply_index(self.rng)
                 File "pandas/_libs/tslibs/offsets.pyx", line 87, in pandas._libs.tslibs.offsets.apply_index_wraps.wrapper
                 File "pandas/_libs/tslibs/offsets.pyx", line 1397, in pandas._libs.tslibs.offsets.BusinessDay.apply_index
               AttributeError: 'PeriodIndex' object has no attribute '_addsub_int_array'
               
               For parameters: <SemiMonthEnd: day_of_month=15>
               Traceback (most recent call last):
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1184, in main_run_server
                   main_run(run_args)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1058, in main_run
                   result = benchmark.do_run()
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 537, in do_run
                   return self.run(*self._current_params)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 627, in run
                   samples, number = self.benchmark_timing(timer, min_repeat, max_repeat,
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 694, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/timeit.py", line 177, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 599, in <lambda>
                   func = lambda: self.func(*param)
                 File "/home/runner/work/pandas/pandas/asv_bench/benchmarks/arithmetic.py", line 469, in time_apply_index
                   offset.apply_index(self.rng)
                 File "pandas/_libs/tslibs/offsets.pyx", line 87, in pandas._libs.tslibs.offsets.apply_index_wraps.wrapper
                 File "pandas/_libs/tslibs/offsets.pyx", line 2319, in pandas._libs.tslibs.offsets.SemiMonthOffset.apply_index
               AttributeError: 'PeriodIndex' object has no attribute '_addsub_int_array'
               
               For parameters: <SemiMonthBegin: day_of_month=15>
               Traceback (most recent call last):
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1184, in main_run_server
                   main_run(run_args)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 1058, in main_run
                   result = benchmark.do_run()
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 537, in do_run
                   return self.run(*self._current_params)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 627, in run
                   samples, number = self.benchmark_timing(timer, min_repeat, max_repeat,
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 694, in benchmark_timing
                   timing = timer.timeit(number)
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/timeit.py", line 177, in timeit
                   timing = self.inner(it, self.timer)
                 File "<timeit-src>", line 6, in inner
                 File "/home/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py", line 599, in <lambda>
                   func = lambda: self.func(*param)
                 File "/home/runner/work/pandas/pandas/asv_bench/benchmarks/arithmetic.py", line 469, in time_apply_index
                   offset.apply_index(self.rng)
                 File "pandas/_libs/tslibs/offsets.pyx", line 87, in pandas._libs.tslibs.offsets.apply_index_wraps.wrapper
                 File "pandas/_libs/tslibs/offsets.pyx", line 2319, in pandas._libs.tslibs.offsets.SemiMonthOffset.apply_index
               AttributeError: 'PeriodIndex' object has no attribute '_addsub_int_array'

…d-nrows-to-read-json merge conflicts

hasnain2808 · 2020-06-04T15:48:53Z

ping

…-json

…d-nrows-to-read-json "get commits for solved benchmark issues"

hasnain2808 · 2020-06-04T20:22:35Z

Ping
@jreback it's all green.

jreback · 2020-06-04T20:44:25Z

thanks @hasnain2808

hasnain2808 · 2020-06-04T20:45:50Z

Thanks @jreback

ENH Add nrow parameter for line delimited json for read_json pandas-d…

15d1d1e

…ev#33916

hasnain2808 changed the title ~~Add nrows to read json~~ Add nrows to read json. May 4, 2020

ENH solve linting via black8 for Add nrow parameter for line delimite…

fc4993f

…d json for read_json pandas-dev#33916

hasnain2808 force-pushed the add-nrows-to-read-json branch from 569db1c to fc4993f Compare May 4, 2020 07:13

WillAyd reviewed May 5, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

jreback requested changes May 5, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/io/json/_json.py Outdated Show resolved Hide resolved

jreback added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels May 5, 2020

hasnain2808 added 2 commits May 7, 2020 10:21

optimized list indexing and type hints added

028d398

solved errors related to typing of args and linting issues

8765192

hasnain2808 requested a review from jreback May 7, 2020 13:15

jreback requested changes May 7, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/io/json/_json.py Show resolved Hide resolved

hasnain2808 force-pushed the add-nrows-to-read-json branch 2 times, most recently from f72ea60 to 896de23 Compare May 9, 2020 21:49

use an iterator to slice strings

ca9c3e0

hasnain2808 force-pushed the add-nrows-to-read-json branch from 896de23 to ca9c3e0 Compare May 10, 2020 01:59

hasnain2808 requested review from jreback and WillAyd May 10, 2020 02:43

WillAyd requested changes May 19, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

Update pandas/io/json/_json.py fixed typo

b355f9c

Co-authored-by: William Ayd <[email protected]>

hasnain2808 force-pushed the add-nrows-to-read-json branch from 40a5c0a to a0a55c9 Compare May 19, 2020 17:28

fixed errors with nrows iterators

74e9c2b

hasnain2808 force-pushed the add-nrows-to-read-json branch from a0a55c9 to 74e9c2b Compare May 20, 2020 07:22

hasnain2808 requested a review from WillAyd May 20, 2020 08:29

WillAyd requested changes May 20, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/io/json/_json.py Outdated Show resolved Hide resolved

remove print statements

237010e

hasnain2808 requested a review from WillAyd May 20, 2020 16:14

WillAyd requested changes May 21, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

hasnain2808 force-pushed the add-nrows-to-read-json branch from d386e49 to ecdbc10 Compare May 22, 2020 19:07

hasnain2808 marked this pull request as ready for review June 2, 2020 18:35

jreback added this to the 1.1 milestone Jun 2, 2020

hasnain2808 added 4 commits June 3, 2020 16:59

add whatsnew and remove unwanted benchmarks

c010399

remove conflict

2355fc5

Merge branch 'master' of https://github.com/pandas-dev/pandas into ad…

2648c3d

…d-nrows-to-read-json issues with merging

add whatsnew for nrows

7fcf3db

hasnain2808 mentioned this pull request Jun 3, 2020

Chunksize from json memory consumption as high as without chunksize #34548

Closed

3 tasks

solve doc error

9e667a1

hasnain2808 added 2 commits June 4, 2020 06:27

remove merge conflict lines

cb3de4d

Merge branch 'master' of https://github.com/pandas-dev/pandas into ad…

d14ff45

…d-nrows-to-read-json "solve merge conflicts while merging to master"

added the conflicting line back

2ce74db

hasnain2808 force-pushed the add-nrows-to-read-json branch from e9c54fe to 2ce74db Compare June 4, 2020 05:27

Merge branch 'master' of https://github.com/pandas-dev/pandas into ad…

133aef9

…d-nrows-to-read-json merge conflicts

hasnain2808 force-pushed the add-nrows-to-read-json branch from 18bf6e7 to 133aef9 Compare June 4, 2020 14:34

hasnain2808 added 2 commits June 4, 2020 23:26

Merge remote-tracking branch 'upstream/master' into add-nrows-to-read…

b3ee647

…-json

Merge branch 'master' of https://github.com/pandas-dev/pandas into ad…

b9a3ebd

…d-nrows-to-read-json "get commits for solved benchmark issues"

jreback approved these changes Jun 4, 2020

View reviewed changes

jreback mentioned this pull request Jun 4, 2020

DEPR: DateOffset.apply and DateOffset.apply_index #34580

Closed

jreback merged commit 89c5a59 into pandas-dev:master Jun 4, 2020

Uh oh!

Add nrows to read json. #33962

Add nrows to read json. #33962

Uh oh!

Conversation

hasnain2808 commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hasnain2808 commented May 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hasnain2808 commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hasnain2808 commented Jun 3, 2020

Uh oh!

WillAyd commented Jun 3, 2020

Uh oh!

hasnain2808 commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hasnain2808 commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hasnain2808 commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hasnain2808 commented Jun 4, 2020

Uh oh!

hasnain2808 commented Jun 4, 2020

Uh oh!

jreback commented Jun 4, 2020

Uh oh!

hasnain2808 commented Jun 4, 2020

Uh oh!

Uh oh!

hasnain2808 commented May 4, 2020 •

edited

Loading

hasnain2808 commented Jun 2, 2020 •

edited

Loading

jreback commented Jun 2, 2020 •

edited

Loading

hasnain2808 commented Jun 4, 2020 •

edited

Loading

hasnain2808 commented Jun 4, 2020 •

edited

Loading

hasnain2808 commented Jun 4, 2020 •

edited

Loading