Skip to content

Read excel nrows #16672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Read excel nrows #16672

wants to merge 5 commits into from

Conversation

alysivji
Copy link
Contributor

@@ -38,6 +38,7 @@ Other Enhancements
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

pd.read_excel() has gained the nrows parameter (...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`

true_values=None, false_values=None, engine=None,
squeeze=False, **kwds):
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
skip_footer=0, index_col=None, names=None, parse_cols=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally don't like to shuffle parameters around in kwargs. Please align the ordering of these params as much as possible with how read_csv does it (obviously only include the current parameters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not a problem. I was using read_csv as a guide and saw that nrows was directly after skiprows, hence the change.

I will go ahead and line up the read_excel kwargs as close as possible to the read_csv kwargs. I see 3or4 out of order with just a quick glance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great. prob need a slightly expanded whatsnew note to tell about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged the kwargs and also updated docstrings to match parameter order (one of the things that always bugs me). Should I also update all internal function kwargs in excel.py to match the external API?

I'm adding the following note to the Other Enhancements section (just wanted to make sure it was the right spot...):
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()`

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I would have the internal API match the external

expected = expected[:num_rows_to_pull]
tm.assert_frame_equal(actual, expected)

with pytest.raises(ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung don't we have an issue about nrows validation in the parser?

Copy link
Member

@gfyoung gfyoung Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but I think you can circumvent the check via non read_csv and read_table calls (the check occurs in _read, which does not get hit on read_excel), which is annoying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alysivji : Put this as a separate test, and use tm.assert_raises_regex to check the specific error message raised in the ValueError (we want it to be that validation failed).

expected = pd.read_excel(os.path.join(self.dirpath,
'test1' + self.ext))
expected = expected[:num_rows_to_pull]
tm.assert_frame_equal(actual, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to pull more rows than exist in the file as well.

@jreback jreback added API Design IO Excel read_excel, to_excel labels Jun 11, 2017
@@ -999,6 +999,8 @@ def _failover_to_python(self):

def read(self, nrows=None):
if nrows is not None:
nrows = _validate_integer('nrows', nrows)
Copy link
Member

@gfyoung gfyoung Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch (see my comment here. However, instead of littering our code with duplicate checks, here's what I think is best:

In your modified parsers.py, there should now be two places where nrows = _validate_integer(...) is called. Delete both of those.

Locate the read method for the TextFileReader in that same file, and add this validation check right before the if nrows is not None. That should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally didn't have this line there, but then I was getting the following error:
TypeError: '>=' not supported between instances of 'int' and 'str' vs
ValueError: 'nrows' must be an integer >=0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should I put this test? There are a bunch in `tests/io/parser', but nothing for read_csv directly.

@@ -82,6 +82,8 @@
Rows to skip at the beginning (0-indexed)
skip_footer : int, default 0
Rows at the end to skip (0-indexed)
nrows : int, default None
Number of rows to parse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded 0.21.0 tag

true_values=None, false_values=None, engine=None,
squeeze=False, **kwds):
def read_excel(io, sheet_name=0, header=0, skiprows=None, nrows=None,
skip_footer=0, index_col=None, names=None, parse_cols=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I would have the internal API match the external

@@ -38,6 +38,7 @@ Other Enhancements
- :func:`read_feather` has gained the ``nthreads`` parameter for multi-threaded operations (:issue:`16359`)
- :func:`DataFrame.clip()` and :func: `Series.cip()` have gained an inplace argument. (:issue: `15388`)
- :func:`crosstab` has gained a ``margins_name`` parameter to define the name of the row / column that will contain the totals when margins=True. (:issue:`15972`)
- ``pd.read_excel()`` has a ``nrows`` parameter (:issue:`16645`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make another entry (with this PR number), in api_breaking section that the kwargs are re-aranged to match pd.read_csv

use :func:`read_excel`

@@ -999,6 +999,8 @@ def _failover_to_python(self):

def read(self, nrows=None):
if nrows is not None:
nrows = _validate_integer('nrows', nrows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this change should be done in the parser itself. See if you can come up with an example that ONLY used pd.read_csv directly.

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Jun 30, 2017
@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

can you rebase/update

@pep8speaks
Copy link

pep8speaks commented Jul 19, 2017

Hello @alysivji! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 19, 2017 at 12:56 Hours UTC

@codecov
Copy link

codecov bot commented Jul 19, 2017

Codecov Report

Merging #16672 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16672      +/-   ##
==========================================
+ Coverage   90.93%   90.93%   +<.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
+ Hits        44802    44803       +1     
  Misses       4467     4467
Flag Coverage Δ
#multiple 88.69% <100%> (ø) ⬆️
#single 40.22% <60%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/excel.py 80.55% <100%> (ø) ⬆️
pandas/io/parsers.py 95.43% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...f1a6740. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 19, 2017

Codecov Report

Merging #16672 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16672      +/-   ##
==========================================
+ Coverage   90.93%   90.93%   +<.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
+ Hits        44802    44803       +1     
  Misses       4467     4467
Flag Coverage Δ
#multiple 88.69% <100%> (ø) ⬆️
#single 40.22% <60%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.43% <100%> (ø) ⬆️
pandas/io/excel.py 80.55% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...ef52114. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2017

can you rebase / update

@alysivji
Copy link
Contributor Author

Sure, I got some time this week. Will take a look at it.

@jreback jreback removed this from the 0.21.0 milestone Sep 23, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

pls rebase

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

closing as stale. if you'd like to continue, pls ping.

@alysivji
Copy link
Contributor Author

I got some time. Picked this back up. Sent another PR (18507)

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas read_excel: only read first few lines
5 participants