-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Optimize read_excel nrows #46894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice gain, especially in xlsx (I think that's pretty common). Do you know why you don't see similar gains in old/xls?
pandas/io/excel/_base.py
Outdated
header_rows += 1 | ||
if skiprows is None: | ||
return header_rows + nrows | ||
if isinstance(skiprows, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_integer again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls try to reuse as much of the parser validation code as possible
Thanks! In the case of xls, xlrd parses all of the data on first read, so there's no saving in I/O time, just a small saving inside get_sheet_data. On the other hand, I was surprised that ods showed so little improvement, especially since ods and xlsx are both zip files containing xml, as far as I understand. I'm also not sure why ods is so much slower than xlsx in the first place. Thanks for the great recommendations on this PR. I'll take a look at them in the coming days. |
@rhshadrach I've made the suggested changes and they were all huge improvements except maybe the type hints. I have one failing mypy error that I'm not sure how to fix. Also, in my last commit I added a bunch of |
pandas/io/excel/_base.py
Outdated
if skiprows is None: | ||
return header_rows + nrows | ||
if is_integer(skiprows): | ||
assert isinstance(skiprows, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we'll be able to have is_integer
be a typeguard to narrow down automatically, but for now you can either use # type: ignore[reason]
comments or var = cast(type, var)
to satisfy mypy. I'd recommend the latter here: skiprows = cast(int, skiprows)
. Similarly for the other blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes - looks really good. I like the use of is_list_like
, but I think we should be excluding sets here; can you pass allow_sets=False
.
if isinstance(self.header, (list, tuple, np.ndarray)): | ||
if not all(map(is_integer, self.header)): | ||
raise ValueError("header must be integer or list of integers") | ||
if any(i < 0 for i in self.header): | ||
raise ValueError( | ||
"cannot specify multi-index header with negative integers" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing it, is validate_header_arg called somewhere instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it took me a while to find it. It's called earlier in TextFileReader:
pandas/io/parsers/readers.py(1847)TextParser()
-> return TextFileReader(*args, **kwds)
pandas/io/parsers/readers.py(1412)__init__()
-> self.options, self.engine = self._clean_options(options, engine)
pandas/io/parsers/readers.py(1607)_clean_options()
-> validate_header_arg(options["header"])
Sets for skiprows currently work, so I've left that as is.
@rhshadrach I think this is ready for review. It didn't pass all tests, but failures were different the last two times, so I suspect that these failures aren't from this PR. Of course, let me know if I'm wrong about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more allow_sets=False, otherwise looks good.
pandas/io/parsers/base_parser.py
Outdated
if not ( | ||
is_sequence | ||
is_list_like(self.index_col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_sets=False
pandas/io/parsers/base_parser.py
Outdated
raise ValueError( | ||
"cannot specify multi-index header with negative integers" | ||
) | ||
if is_list_like(self.header): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_sets=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @jreback for review
thanks @ahawryluk |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I timed read_excel on a file with 10 columns and 1000 rows, and report the best of 10 repeats (ms) below. This change can make a modest improvement on xls and ods files, and a significant improvement on xlsx and xlsb files. When
nrows=None
this has no measurable impact on the run time.Here are the results of
asv run -e -E existing --bench ReadExcel
showing similar results (the benchmark spreadsheet is different than the one above).