-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Read excel nrows #16672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read excel nrows #16672
Conversation
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -38,6 +38,7 @@ Other Enhancements | |||
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`) | |||
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`) | |||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`) | |||
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
pd.read_excel()
has gained the nrows
parameter (...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv
use :func:`read_excel`
pandas/io/excel.py
Outdated
true_values=None, false_values=None, engine=None, | ||
squeeze=False, **kwds): | ||
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None, | ||
skip_footer=0, index_col=None, names=None, parse_cols=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we normally don't like to shuffle parameters around in kwargs. Please align the ordering of these params as much as possible with how read_csv does it (obviously only include the current parameters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, not a problem. I was using read_csv as a guide and saw that nrows
was directly after skiprows
, hence the change.
I will go ahead and line up the read_excel kwargs as close as possible to the read_csv kwargs. I see 3or4 out of order with just a quick glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great. prob need a slightly expanded whatsnew note to tell about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rearranged the kwargs and also updated docstrings to match parameter order (one of the things that always bugs me). Should I also update all internal function kwargs in excel.py
to match the external API?
I'm adding the following note to the Other Enhancements section (just wanted to make sure it was the right spot...):
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()`
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I would have the internal API match the external
pandas/tests/io/test_excel.py
Outdated
expected = expected[:num_rows_to_pull] | ||
tm.assert_frame_equal(actual, expected) | ||
|
||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung don't we have an issue about nrows
validation in the parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, but I think you can circumvent the check via non read_csv
and read_table
calls (the check occurs in _read
, which does not get hit on read_excel
), which is annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alysivji : Put this as a separate test, and use tm.assert_raises_regex
to check the specific error message raised in the ValueError
(we want it to be that validation failed).
expected = pd.read_excel(os.path.join(self.dirpath, | ||
'test1' + self.ext)) | ||
expected = expected[:num_rows_to_pull] | ||
tm.assert_frame_equal(actual, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to pull more rows than exist in the file as well.
pandas/io/parsers.py
Outdated
@@ -999,6 +999,8 @@ def _failover_to_python(self): | |||
|
|||
def read(self, nrows=None): | |||
if nrows is not None: | |||
nrows = _validate_integer('nrows', nrows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch (see my comment here. However, instead of littering our code with duplicate checks, here's what I think is best:
In your modified parsers.py
, there should now be two places where nrows = _validate_integer(...)
is called. Delete both of those.
Locate the read
method for the TextFileReader
in that same file, and add this validation check right before the if nrows is not None
. That should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally didn't have this line there, but then I was getting the following error:
TypeError: '>=' not supported between instances of 'int' and 'str'
vs
ValueError: 'nrows' must be an integer >=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where should I put this test? There are a bunch in `tests/io/parser', but nothing for read_csv directly.
pandas/io/excel.py
Outdated
@@ -82,6 +82,8 @@ | |||
Rows to skip at the beginning (0-indexed) | |||
skip_footer : int, default 0 | |||
Rows at the end to skip (0-indexed) | |||
nrows : int, default None | |||
Number of rows to parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded 0.21.0 tag
pandas/io/excel.py
Outdated
true_values=None, false_values=None, engine=None, | ||
squeeze=False, **kwds): | ||
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None, | ||
skip_footer=0, index_col=None, names=None, parse_cols=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I would have the internal API match the external
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -38,6 +38,7 @@ Other Enhancements | |||
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`) | |||
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`) | |||
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`) | |||
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv
use :func:`read_excel`
pandas/io/parsers.py
Outdated
@@ -999,6 +999,8 @@ def _failover_to_python(self): | |||
|
|||
def read(self, nrows=None): | |||
if nrows is not None: | |||
nrows = _validate_integer('nrows', nrows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.
can you rebase/update |
Hello @alysivji! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 19, 2017 at 12:56 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #16672 +/- ##
==========================================
+ Coverage 90.93% 90.93% +<.01%
==========================================
Files 161 161
Lines 49269 49270 +1
==========================================
+ Hits 44802 44803 +1
Misses 4467 4467
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16672 +/- ##
==========================================
+ Coverage 90.93% 90.93% +<.01%
==========================================
Files 161 161
Lines 49269 49270 +1
==========================================
+ Hits 44802 44803 +1
Misses 4467 4467
Continue to review full report at Codecov.
|
can you rebase / update |
Sure, I got some time this week. Will take a look at it. |
pls rebase |
closing as stale. if you'd like to continue, pls ping. |
I got some time. Picked this back up. Sent another PR (18507) Thanks. |
git diff upstream/master --name-only -- '*.py' | flake8 --diff