Skip to content

How to use nrows along with chunksize in read_json() #36791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
madolmo opened this issue Oct 1, 2020 · 2 comments · Fixed by #38293
Closed
2 tasks done

How to use nrows along with chunksize in read_json() #36791

madolmo opened this issue Oct 1, 2020 · 2 comments · Fixed by #38293
Labels
IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@madolmo
Copy link

madolmo commented Oct 1, 2020

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

Why do I need to use nrows when reading large json line files with chunksize option?
Since version 1.1 I'm having troubles with the function read_json() because even if I specify the option chunksize with the correct value (the value that used to work with pandas v.1.0.5), the file seems to be read at once, with a memroy error in my case. If I add the nrows option this doesn't happen but why? And what is the value you have to specify for the nrows parameter in order to load the entire file? Do you have to know in advance the maximum number of rows? Is there any special value for "all rows" like -1 o 0 ?

Thanks

#this raises a Memory Error (with a 4GB file) - this worked on version 1.0.5
reader = pd.read_json(f"{path}map_records.json",orient='records' ,lines=True, chunksize=100000)
chunks=[chunk[(chunk.bidbasket=="BSKGEOALL00000000001")&(chunk.tipomappa == "AULTIPMPS_GIT")][['bidsubzona','idoriginale','bidciv','bidbasket','tipomappa']]  for chunk in reader]

#this works, but it loads up to <nrorws> rows and I have to know the maximum number of rows in advance
reader = pd.read_json(f"{path}map_records.json",orient='records' ,lines=True, chunksize=100000, nrows=20000000)
chunks=[chunk[(chunk.bidbasket=="BSKGEOALL00000000001")&(chunk.tipomappa == "AULTIPMPS_GIT")][['bidsubzona','idoriginale','bidciv','bidbasket','tipomappa']]  for chunk in reader]

@madolmo madolmo added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Oct 1, 2020
@alex-warmath
Copy link

alex-warmath commented Nov 29, 2020

Also ran into this problem and it seems like true bug. According to the documentation, your question should be answered that no nrows argument is necessary. Since version 1.1 however, you have to specify nrows or else get a Memory Error. The current workaround seems to be just putting a massive value for nrows, since anything higher than the true value doesn't matter. You don't have to know the number of rows in advance, just an upper bound.

I attached images of where I had the same problem -- note that the error disappears when you specify a massive value for nrows.

memory error

no memory error

@robertwb
Copy link
Contributor

robertwb commented Dec 4, 2020

This is a duplicate with #34548 which was caused by a bug when implementing nrows. (The workaround is to set nrows to be a very large number.)

robertwb added a commit to robertwb/pandas that referenced this issue Dec 4, 2020
@jreback jreback added this to the 1.2 milestone Dec 8, 2020
@jreback jreback added Bug IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Dec 8, 2020
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug labels Dec 11, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Dec 13, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants