-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: support "nrows" and "chunksize" together #15756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment. lgtm.
can you also add some tests for nrows/chunksize that they must be positive integers?
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -291,6 +291,7 @@ Other enhancements | |||
- ``Series`` provides a ``to_excel`` method to output Excel files (:issue:`8825`) | |||
- The ``usecols`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`14154`) | |||
- The ``skiprows`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`10882`) | |||
- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv`` are no more incompatible (:issue:`15755`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no more incompatible -> are supported if both are passed.
actually I think we need some tests if chunksize > nrows. and with get_chunk with nrows & chunksize specified. (mostly what happens at the edge point). |
Codecov Report
@@ Coverage Diff @@
## master #15756 +/- ##
==========================================
- Coverage 91.01% 91% -0.02%
==========================================
Files 143 143
Lines 49371 49370 -1
==========================================
- Hits 44937 44928 -9
- Misses 4434 4442 +8
Continue to review full report at Codecov.
|
shall we cast floats (current behavior, as of #10476), or raise? |
nrows already has a validation function (which casts to float if possible), we can probably do the same for chunksize |
There is currently actually a memory leak/infinite loop when doing eg Negative values for nrows/chunksize currently return an empty frame, so which is kind of valid. If we want to change that, I would keep that for another PR/issue. |
yep that's a good idea |
Agree, but since chunksize=0 easily results, as @jorisvandenbossche mentioned, in infinite loops, I would impose chunksize >= 1. And maybe for (partial - asking nrows=0 is fine) coherence we want to impose nrows >= 0? (I would like this, because negative chunksize/nrows could be misinterpreted as parsing starting from the end...) In the meanwhile, I have pushed the requested tests for the original edit, in case we want to move the rest/discussion to another PR |
I think this is kind of wonky. We don't actually support indexing from the end of the file. I would raise on this. (can do in another independent issue). |
@@ -1009,6 +999,10 @@ def _create_index(self, ret): | |||
def get_chunk(self, size=None): | |||
if size is None: | |||
size = self.chunksize | |||
if self.nrows is not None: | |||
if self._currow >= self.nrows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isn't this just
if self.nrows is not None:
if self._currow >= self.nrows:
size = 0
size = min(....)
return self.read(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raising StopIteration
directly here seems odd. as self.read(nrows=0)
correctly returns the result (as an empty DataFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because a parser's .read()
raises (or at least, can legitimately raise) StopIteration
only when the current row increases above self.nrows
, and this never happens if you keep asking for 0 rows at a time. This is not hypothetical - the "# With nrows
" step of test_read_chunksize
in pandas/tests/io/parser/common.py
hangs for the PythonParser
(and possibly others) if I change the code as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok
cc @gfyoung if you can have a look. |
LGTM! |
thanks @toobaz |
You're welcome! |
closes pandas-dev#15755 Author: Pietro Battiston <[email protected]> Closes pandas-dev#15756 from toobaz/nrows_chunksize and squashes the following commits: d0288e3 [Pietro Battiston] ENH: support "nrows" and "chunksize" together
git diff master | flake8 --diff
(except forwhatsnew/v0.20.0.txt
, I don't know why)