How to use nrows along with chunksize in read_json() #36791

madolmo · 2020-10-01T21:53:46Z

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Question about pandas

Why do I need to use nrows when reading large json line files with chunksize option?
Since version 1.1 I'm having troubles with the function read_json() because even if I specify the option chunksize with the correct value (the value that used to work with pandas v.1.0.5), the file seems to be read at once, with a memroy error in my case. If I add the nrows option this doesn't happen but why? And what is the value you have to specify for the nrows parameter in order to load the entire file? Do you have to know in advance the maximum number of rows? Is there any special value for "all rows" like -1 o 0 ?

Thanks

#this raises a Memory Error (with a 4GB file) - this worked on version 1.0.5
reader = pd.read_json(f"{path}map_records.json",orient='records' ,lines=True, chunksize=100000)
chunks=[chunk[(chunk.bidbasket=="BSKGEOALL00000000001")&(chunk.tipomappa == "AULTIPMPS_GIT")][['bidsubzona','idoriginale','bidciv','bidbasket','tipomappa']]  for chunk in reader]

#this works, but it loads up to <nrorws> rows and I have to know the maximum number of rows in advance
reader = pd.read_json(f"{path}map_records.json",orient='records' ,lines=True, chunksize=100000, nrows=20000000)
chunks=[chunk[(chunk.bidbasket=="BSKGEOALL00000000001")&(chunk.tipomappa == "AULTIPMPS_GIT")][['bidsubzona','idoriginale','bidciv','bidbasket','tipomappa']]  for chunk in reader]

alex-warmath · 2020-11-29T22:30:13Z

Also ran into this problem and it seems like true bug. According to the documentation, your question should be answered that no nrows argument is necessary. Since version 1.1 however, you have to specify nrows or else get a Memory Error. The current workaround seems to be just putting a massive value for nrows, since anything higher than the true value doesn't matter. You don't have to know the number of rows in advance, just an upper bound.

I attached images of where I had the same problem -- note that the error disappears when you specify a massive value for nrows.

robertwb · 2020-12-04T18:33:03Z

This is a duplicate with #34548 which was caused by a bug when implementing nrows. (The workaround is to set nrows to be a very large number.)

…son chunksize.

madolmo added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Oct 1, 2020

robertwb added a commit to robertwb/pandas that referenced this issue Dec 4, 2020

Fix pandas-dev#36791 and pandas-dev#34548 correctly respecting read_j…

0b477a6

…son chunksize.

robertwb mentioned this issue Dec 4, 2020

BUG: read_json does not respect chunksize #38293

Merged

6 tasks

jreback added this to the 1.2 milestone Dec 8, 2020

jreback added Bug IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Dec 8, 2020

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug labels Dec 11, 2020

jreback modified the milestones: 1.2, Contributions Welcome Dec 13, 2020

jreback modified the milestones: Contributions Welcome, 1.2 Dec 22, 2020

jreback closed this as completed in #38293 Dec 23, 2020

simonm3 mentioned this issue Jun 9, 2021

BUG: read_json chunksize parameter does not work #41905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use nrows along with chunksize in read_json() #36791

How to use nrows along with chunksize in read_json() #36791

madolmo commented Oct 1, 2020 •

edited

Loading

alex-warmath commented Nov 29, 2020 •

edited

Loading

robertwb commented Dec 4, 2020

How to use nrows along with chunksize in read_json() #36791

How to use nrows along with chunksize in read_json() #36791

Comments

madolmo commented Oct 1, 2020 • edited Loading

Question about pandas

alex-warmath commented Nov 29, 2020 • edited Loading

robertwb commented Dec 4, 2020

madolmo commented Oct 1, 2020 •

edited

Loading

alex-warmath commented Nov 29, 2020 •

edited

Loading