Read excel nrows #16672

alysivji · 2017-06-11T22:21:42Z

closes Pandas read_excel: only read first few lines #16645
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

gfyoung · 2017-06-11T23:08:24Z

doc/source/whatsnew/v0.21.0.txt

@@ -38,6 +38,7 @@ Other Enhancements
 - :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
 - :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
 - :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
+- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)


How about:

pd.read_excel() has gained the nrows parameter (...)

you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`

jreback · 2017-06-11T23:09:59Z

pandas/io/excel.py

-               true_values=None, false_values=None, engine=None,
-               squeeze=False, **kwds):
+def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
+               skip_footer=0, index_col=None, names=None, parse_cols=None,


we normally don't like to shuffle parameters around in kwargs. Please align the ordering of these params as much as possible with how read_csv does it (obviously only include the current parameters).

Sure, not a problem. I was using read_csv as a guide and saw that nrows was directly after skiprows, hence the change.

I will go ahead and line up the read_excel kwargs as close as possible to the read_csv kwargs. I see 3or4 out of order with just a quick glance.

ok great. prob need a slightly expanded whatsnew note to tell about this

I rearranged the kwargs and also updated docstrings to match parameter order (one of the things that always bugs me). Should I also update all internal function kwargs in excel.py to match the external API?

I'm adding the following note to the Other Enhancements section (just wanted to make sure it was the right spot...):
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()`

Thanks!

yes I would have the internal API match the external

jreback · 2017-06-11T23:11:11Z

pandas/tests/io/test_excel.py

+        expected = expected[:num_rows_to_pull]
+        tm.assert_frame_equal(actual, expected)
+
+        with pytest.raises(ValueError):


@gfyoung don't we have an issue about nrows validation in the parser?

We do, but I think you can circumvent the check via non read_csv and read_table calls (the check occurs in _read, which does not get hit on read_excel), which is annoying.

@alysivji : Put this as a separate test, and use tm.assert_raises_regex to check the specific error message raised in the ValueError (we want it to be that validation failed).

jreback · 2017-06-11T23:11:25Z

pandas/tests/io/test_excel.py

+        expected = pd.read_excel(os.path.join(self.dirpath,
+                                              'test1' + self.ext))
+        expected = expected[:num_rows_to_pull]
+        tm.assert_frame_equal(actual, expected)


try to pull more rows than exist in the file as well.

gfyoung · 2017-06-11T23:14:04Z

pandas/io/parsers.py

@@ -999,6 +999,8 @@ def _failover_to_python(self):

    def read(self, nrows=None):
        if nrows is not None:
+            nrows = _validate_integer('nrows', nrows)


Good catch (see my comment here. However, instead of littering our code with duplicate checks, here's what I think is best:

In your modified parsers.py, there should now be two places where nrows = _validate_integer(...) is called. Delete both of those.

Locate the read method for the TextFileReader in that same file, and add this validation check right before the if nrows is not None. That should work.

I originally didn't have this line there, but then I was getting the following error:
TypeError: '>=' not supported between instances of 'int' and 'str' vs
ValueError: 'nrows' must be an integer >=0

yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

Where should I put this test? There are a bunch in `tests/io/parser', but nothing for read_csv directly.

jreback · 2017-06-14T10:33:46Z

pandas/io/excel.py

@@ -82,6 +82,8 @@
    Rows to skip at the beginning (0-indexed)
 skip_footer : int, default 0
    Rows at the end to skip (0-indexed)
+nrows : int, default None
+    Number of rows to parse


versionadded 0.21.0 tag

jreback · 2017-06-14T10:34:12Z

pandas/io/excel.py

-               true_values=None, false_values=None, engine=None,
-               squeeze=False, **kwds):
+def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
+               skip_footer=0, index_col=None, names=None, parse_cols=None,


yes I would have the internal API match the external

jreback · 2017-06-14T10:35:17Z

doc/source/whatsnew/v0.21.0.txt

@@ -38,6 +38,7 @@ Other Enhancements
 - :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
 - :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
 - :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
+- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)


you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`

jreback · 2017-06-14T10:37:02Z

pandas/io/parsers.py

@@ -999,6 +999,8 @@ def _failover_to_python(self):

    def read(self, nrows=None):
        if nrows is not None:
+            nrows = _validate_integer('nrows', nrows)


yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

jreback · 2017-07-19T10:31:23Z

can you rebase/update

pep8speaks · 2017-07-19T12:52:55Z

Hello @alysivji! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 19, 2017 at 12:56 Hours UTC

codecov · 2017-07-19T12:53:09Z

Codecov Report

Merging #16672 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16672      +/-   ##
==========================================
+ Coverage   90.93%   90.93%   +<.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
+ Hits        44802    44803       +1     
  Misses       4467     4467

Flag	Coverage Δ
#multiple	`88.69% <100%> (ø)`	⬆️
#single	`40.22% <60%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/excel.py	`80.55% <100%> (ø)`	⬆️
pandas/io/parsers.py	`95.43% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...f1a6740. Read the comment docs.

codecov · 2017-07-19T12:53:15Z

Codecov Report

Merging #16672 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16672      +/-   ##
==========================================
+ Coverage   90.93%   90.93%   +<.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
+ Hits        44802    44803       +1     
  Misses       4467     4467

Flag	Coverage Δ
#multiple	`88.69% <100%> (ø)`	⬆️
#single	`40.22% <60%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.43% <100%> (ø)`	⬆️
pandas/io/excel.py	`80.55% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...ef52114. Read the comment docs.

jreback · 2017-09-10T14:49:59Z

can you rebase / update

alysivji · 2017-09-11T01:32:28Z

Sure, I got some time this week. Will take a look at it.

jreback · 2017-09-23T20:10:25Z

pls rebase

jreback · 2017-11-10T20:18:15Z

closing as stale. if you'd like to continue, pls ping.

alysivji · 2017-11-26T18:33:07Z

I got some time. Picked this back up. Sent another PR (18507)

Thanks.

alysivji added 2 commits June 11, 2017 17:09

Add nrows parameter to pd.read_excel()

9788cdb

Update documentation for pd.read_excel() nrow parameter addition

53bdb62

gfyoung reviewed Jun 11, 2017

View reviewed changes

jreback reviewed Jun 11, 2017

View reviewed changes

jreback added API Design IO Excel read_excel, to_excel labels Jun 11, 2017

gfyoung reviewed Jun 11, 2017

View reviewed changes

jreback requested changes Jun 14, 2017

View reviewed changes

Adding additional tests and moving nrows validate_integers check

72cd981

TomAugspurger added this to the 0.21.0 milestone Jun 30, 2017

Clean up excel.py; change order of params to match read_csv()

f1a6740

Fixing PEP8 issues: whitespace

ef52114

jreback removed this from the 0.21.0 milestone Sep 23, 2017

jreback closed this Nov 10, 2017

alysivji mentioned this pull request Nov 26, 2017

Add nrows parameter to pandas.read_excel() #18507

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read excel nrows #16672

Read excel nrows #16672

alysivji commented Jun 11, 2017

gfyoung Jun 11, 2017

jreback Jun 14, 2017

jreback Jun 11, 2017

alysivji Jun 11, 2017

jreback Jun 11, 2017

alysivji Jun 13, 2017

jreback Jun 14, 2017

jreback Jun 11, 2017

gfyoung Jun 11, 2017 •

edited

Loading

gfyoung Jun 11, 2017

jreback Jun 11, 2017

gfyoung Jun 11, 2017 •

edited

Loading

alysivji Jun 11, 2017

jreback Jun 14, 2017

alysivji Jun 14, 2017

jreback Jun 14, 2017

jreback Jun 14, 2017

jreback Jun 14, 2017

jreback Jun 14, 2017

jreback commented Jul 19, 2017

pep8speaks commented Jul 19, 2017 •

edited

Loading

codecov bot commented Jul 19, 2017

codecov bot commented Jul 19, 2017 •

edited

Loading

jreback commented Sep 10, 2017

alysivji commented Sep 11, 2017

jreback commented Sep 23, 2017

jreback commented Nov 10, 2017

alysivji commented Nov 26, 2017

Read excel nrows #16672

Read excel nrows #16672

Conversation

alysivji commented Jun 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 19, 2017

pep8speaks commented Jul 19, 2017 • edited Loading

Comment last updated on July 19, 2017 at 12:56 Hours UTC

codecov bot commented Jul 19, 2017

Codecov Report

codecov bot commented Jul 19, 2017 • edited Loading

Codecov Report

jreback commented Sep 10, 2017

alysivji commented Sep 11, 2017

jreback commented Sep 23, 2017

jreback commented Nov 10, 2017

alysivji commented Nov 26, 2017

gfyoung Jun 11, 2017 •

edited

Loading

gfyoung Jun 11, 2017 •

edited

Loading

pep8speaks commented Jul 19, 2017 •

edited

Loading

codecov bot commented Jul 19, 2017 •

edited

Loading