BUG: read_json chunksize parameter does not work #41905

simonm3 · 2021-06-09T17:57:44Z

If I set chunksize=100 without nrows then it crashes with a memory error. If I set chunksize and nrows then it returns the maximum rows of the two. The workaround of setting nrows to a high number does not work as it just crashes with a memory error even with low chunksize.

I am using ubuntu20 wsl2/windows10, python=3.8, pandas=1.2.4

This has been reported in other closed issues but does not seem to be fixed.

#34548
#36791

aacosta13 · 2021-06-10T05:29:11Z

Is there any way you could post the code and/or dataset that you attempted to run with? Just tested it myself using the same Yelp dataset as one of the comments from the issue (#36791) you linked and it seems that chunksize worked with both a value of 100 and 1000.

simonm3 · 2021-06-10T11:00:28Z

This is the link: http://download.companieshouse.gov.uk/en_pscdata.html Below is trying to read chunksize=10 with the full 6GB file. Tried this in windows and in wsl/ubuntu. For testing try one of the smaller files from the link. This fits in memory but takes a very long time and returns 500K rows. [image: image.png]

…

On Thu, 10 Jun 2021 at 06:29, Armando Acosta ***@***.***> wrote: Is there any way you could post the code and/or dataset that you attempted to run with? Just tested it myself using the same Yelp dataset as one of the comments from the issue (#36791 <#36791>) you linked and it seems that chunksize worked with both a value of 100 and 1000. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#41905 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJE32KUXKAUQG7CR4EJTSTTSBETJANCNFSM46MQW3BQ> .

simonm3 · 2021-06-10T11:03:05Z

This is the code:

import pandas as pd
#path = r"C:\Users\simon\Downloads\raw\persons-with-significant-control-snapshot-2021-06-05.txt"
path = r"C:\Users\simon\Downloads\raw\psc-snapshot-2021-06-10_1of19.txt"
df = pd.read_json(path, lines=True, chunksize=10)
df.read()

simonm3 · 2021-06-10T11:06:07Z

Oh. I thought you called read to read a chunk but actually it is an iterator. Sorry my mistake!

simonm3 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2021

simonm3 closed this as completed Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_json chunksize parameter does not work #41905

BUG: read_json chunksize parameter does not work #41905

simonm3 commented Jun 9, 2021

aacosta13 commented Jun 10, 2021

simonm3 commented Jun 10, 2021 via email •

edited

Loading

simonm3 commented Jun 10, 2021

simonm3 commented Jun 10, 2021

BUG: read_json chunksize parameter does not work #41905

BUG: read_json chunksize parameter does not work #41905

Comments

simonm3 commented Jun 9, 2021

aacosta13 commented Jun 10, 2021

simonm3 commented Jun 10, 2021 via email • edited Loading

simonm3 commented Jun 10, 2021

simonm3 commented Jun 10, 2021

simonm3 commented Jun 10, 2021 via email •

edited

Loading