ENH: support "nrows" and "chunksize" together #15756

toobaz · 2017-03-21T00:41:13Z

closes nrows incompatible with chunksize in read_csv #15755
tests added / passed
passes git diff master | flake8 --diff (except for whatsnew/v0.20.0.txt, I don't know why)
whatsnew entry

jreback

minor comment. lgtm.

can you also add some tests for nrows/chunksize that they must be positive integers?

jreback · 2017-03-21T00:47:35Z

doc/source/whatsnew/v0.20.0.txt

@@ -291,6 +291,7 @@ Other enhancements
 - ``Series`` provides a ``to_excel`` method to output Excel files (:issue:`8825`)
 - The ``usecols`` argument in ``pd.read_csv`` now accepts a callable function as a value  (:issue:`14154`)
 - The ``skiprows`` argument in ``pd.read_csv`` now accepts a callable function as a value  (:issue:`10882`)
+- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv`` are no more incompatible  (:issue:`15755`)


no more incompatible -> are supported if both are passed.

jreback · 2017-03-21T00:52:05Z

actually I think we need some tests if chunksize > nrows.

and with get_chunk with nrows & chunksize specified. (mostly what happens at the edge point).

codecov · 2017-03-21T01:13:36Z

Codecov Report

Merging #15756 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15756      +/-   ##
==========================================
- Coverage   91.01%      91%   -0.02%     
==========================================
  Files         143      143              
  Lines       49371    49370       -1     
==========================================
- Hits        44937    44928       -9     
- Misses       4434     4442       +8

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.51% <100%> (-0.01%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.86% <0%> (-0.1%)`	⬇️
pandas/core/common.py	`91.3% <0%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e753d7...d0288e3. Read the comment docs.

toobaz · 2017-03-21T09:00:51Z

can you also add some tests for nrows/chunksize that they must be positive integers?

shall we cast floats (current behavior, as of #10476), or raise?

jorisvandenbossche · 2017-03-21T10:10:43Z

can you also add some tests for nrows/chunksize that they must be positive integers?

shall we cast floats (current behavior, as of #10476), or raise?

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

jorisvandenbossche · 2017-03-21T10:17:38Z

There is currently actually a memory leak/infinite loop when doing eg chunksize=0.5, so would be good to validate chunksize the same as nrows.

Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue.

jreback · 2017-03-21T12:26:23Z

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

yep that's a good idea

toobaz · 2017-03-21T14:55:58Z

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

Agree, but since chunksize=0 easily results, as @jorisvandenbossche mentioned, in infinite loops, I would impose chunksize >= 1. And maybe for (partial - asking nrows=0 is fine) coherence we want to impose nrows >= 0?

(I would like this, because negative chunksize/nrows could be misinterpreted as parsing starting from the end...)

In the meanwhile, I have pushed the requested tests for the original edit, in case we want to move the rest/discussion to another PR

jreback · 2017-03-21T15:06:49Z

Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue.

I think this is kind of wonky. We don't actually support indexing from the end of the file. I would raise on this. (can do in another independent issue).

jreback · 2017-03-21T15:53:51Z

pandas/io/parsers.py

@@ -1009,6 +999,10 @@ def _create_index(self, ret):
    def get_chunk(self, size=None):
        if size is None:
            size = self.chunksize
+        if self.nrows is not None:
+            if self._currow >= self.nrows:


why isn't this just

if self.nrows is not None: if self._currow >= self.nrows: size = 0 size = min(....) return self.read(...)

?

raising StopIteration directly here seems odd. as self.read(nrows=0) correctly returns the result (as an empty DataFrame)

Because a parser's .read() raises (or at least, can legitimately raise) StopIteration only when the current row increases above self.nrows, and this never happens if you keep asking for 0 rows at a time. This is not hypothetical - the "# With nrows" step of test_read_chunksize in pandas/tests/io/parser/common.py hangs for the PythonParser (and possibly others) if I change the code as you suggest.

jreback · 2017-03-21T17:46:03Z

cc @gfyoung if you can have a look.

gfyoung · 2017-03-21T17:48:49Z

LGTM!

jreback · 2017-03-21T18:06:06Z

thanks @toobaz

toobaz · 2017-03-21T18:54:46Z

You're welcome!

closes pandas-dev#15755 Author: Pietro Battiston <[email protected]> Closes pandas-dev#15756 from toobaz/nrows_chunksize and squashes the following commits: d0288e3 [Pietro Battiston] ENH: support "nrows" and "chunksize" together

jreback approved these changes Mar 21, 2017

View reviewed changes

jreback added Enhancement IO CSV read_csv, to_csv labels Mar 21, 2017

ENH: support "nrows" and "chunksize" together

d0288e3

toobaz force-pushed the nrows_chunksize branch from 75287ae to d0288e3 Compare March 21, 2017 14:54

jreback reviewed Mar 21, 2017

View reviewed changes

toobaz mentioned this pull request Mar 21, 2017

Proper checks on nrows and chunksize for read_csv #15767

Closed

jreback added this to the 0.20.0 milestone Mar 21, 2017

jreback closed this in 163d18e Mar 21, 2017

toobaz deleted the nrows_chunksize branch March 21, 2017 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support "nrows" and "chunksize" together #15756

ENH: support "nrows" and "chunksize" together #15756

toobaz commented Mar 21, 2017

jreback left a comment

jreback Mar 21, 2017

jreback commented Mar 21, 2017

codecov bot commented Mar 21, 2017 •

edited

Loading

toobaz commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

jreback commented Mar 21, 2017

toobaz commented Mar 21, 2017

jreback commented Mar 21, 2017

jreback Mar 21, 2017 •

edited

Loading

jreback Mar 21, 2017

toobaz Mar 21, 2017 •

edited

Loading

jreback Mar 21, 2017

jreback commented Mar 21, 2017

gfyoung commented Mar 21, 2017

jreback commented Mar 21, 2017

toobaz commented Mar 21, 2017

ENH: support "nrows" and "chunksize" together #15756

ENH: support "nrows" and "chunksize" together #15756

Conversation

toobaz commented Mar 21, 2017

jreback left a comment

Choose a reason for hiding this comment

jreback Mar 21, 2017

Choose a reason for hiding this comment

jreback commented Mar 21, 2017

codecov bot commented Mar 21, 2017 • edited Loading

Codecov Report

toobaz commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

jreback commented Mar 21, 2017

toobaz commented Mar 21, 2017

jreback commented Mar 21, 2017

jreback Mar 21, 2017 • edited Loading

Choose a reason for hiding this comment

jreback Mar 21, 2017

Choose a reason for hiding this comment

toobaz Mar 21, 2017 • edited Loading

Choose a reason for hiding this comment

jreback Mar 21, 2017

Choose a reason for hiding this comment

jreback commented Mar 21, 2017

gfyoung commented Mar 21, 2017

jreback commented Mar 21, 2017

toobaz commented Mar 21, 2017

codecov bot commented Mar 21, 2017 •

edited

Loading

jreback Mar 21, 2017 •

edited

Loading

toobaz Mar 21, 2017 •

edited

Loading