Skip to content

ENH: support "nrows" and "chunksize" together #15756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Mar 21, 2017

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. lgtm.

can you also add some tests for nrows/chunksize that they must be positive integers?

@@ -291,6 +291,7 @@ Other enhancements
- ``Series`` provides a ``to_excel`` method to output Excel files (:issue:`8825`)
- The ``usecols`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`14154`)
- The ``skiprows`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`10882`)
- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv`` are no more incompatible (:issue:`15755`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more incompatible -> are supported if both are passed.

@jreback jreback added Enhancement IO CSV read_csv, to_csv labels Mar 21, 2017
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

actually I think we need some tests if chunksize > nrows.

and with get_chunk with nrows & chunksize specified. (mostly what happens at the edge point).

@codecov
Copy link

codecov bot commented Mar 21, 2017

Codecov Report

Merging #15756 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15756      +/-   ##
==========================================
- Coverage   91.01%      91%   -0.02%     
==========================================
  Files         143      143              
  Lines       49371    49370       -1     
==========================================
- Hits        44937    44928       -9     
- Misses       4434     4442       +8
Impacted Files Coverage Δ
pandas/io/parsers.py 95.51% <100%> (-0.01%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.86% <0%> (-0.1%) ⬇️
pandas/core/common.py 91.3% <0%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e753d7...d0288e3. Read the comment docs.

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

can you also add some tests for nrows/chunksize that they must be positive integers?

shall we cast floats (current behavior, as of #10476), or raise?

@jorisvandenbossche
Copy link
Member

can you also add some tests for nrows/chunksize that they must be positive integers?

shall we cast floats (current behavior, as of #10476), or raise?

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

@jorisvandenbossche
Copy link
Member

There is currently actually a memory leak/infinite loop when doing eg chunksize=0.5, so would be good to validate chunksize the same as nrows.

Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

yep that's a good idea

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize

Agree, but since chunksize=0 easily results, as @jorisvandenbossche mentioned, in infinite loops, I would impose chunksize >= 1. And maybe for (partial - asking nrows=0 is fine) coherence we want to impose nrows >= 0?

(I would like this, because negative chunksize/nrows could be misinterpreted as parsing starting from the end...)

In the meanwhile, I have pushed the requested tests for the original edit, in case we want to move the rest/discussion to another PR

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue.

I think this is kind of wonky. We don't actually support indexing from the end of the file. I would raise on this. (can do in another independent issue).

@@ -1009,6 +999,10 @@ def _create_index(self, ret):
def get_chunk(self, size=None):
if size is None:
size = self.chunksize
if self.nrows is not None:
if self._currow >= self.nrows:
Copy link
Contributor

@jreback jreback Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this just

if self.nrows is not None:
    if self._currow >= self.nrows:
        size = 0
    size = min(....)
return self.read(...)

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raising StopIteration directly here seems odd. as self.read(nrows=0) correctly returns the result (as an empty DataFrame)

Copy link
Member Author

@toobaz toobaz Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a parser's .read() raises (or at least, can legitimately raise) StopIteration only when the current row increases above self.nrows, and this never happens if you keep asking for 0 rows at a time. This is not hypothetical - the "# With nrows" step of test_read_chunksize in pandas/tests/io/parser/common.py hangs for the PythonParser (and possibly others) if I change the code as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

cc @gfyoung if you can have a look.

@jreback jreback added this to the 0.20.0 milestone Mar 21, 2017
@gfyoung
Copy link
Member

gfyoung commented Mar 21, 2017

LGTM!

@jreback jreback closed this in 163d18e Mar 21, 2017
@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

thanks @toobaz

@toobaz
Copy link
Member Author

toobaz commented Mar 21, 2017

You're welcome!

@toobaz toobaz deleted the nrows_chunksize branch March 21, 2017 18:54
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
closes pandas-dev#15755

Author: Pietro Battiston <[email protected]>

Closes pandas-dev#15756 from toobaz/nrows_chunksize and squashes the following commits:

d0288e3 [Pietro Battiston] ENH: support "nrows" and "chunksize" together
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nrows incompatible with chunksize in read_csv
4 participants