ENH: Using nrows option while processing xlsb files #39518
Labels
Duplicate Report
Duplicate issue or pull request
IO Excel
read_excel, to_excel
Performance
Memory or execution speed performance
Is your feature request related to a problem?
I wish pandas can use nrows eagerly while processing xlsx or xlsb files
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_pyxlsb.py#L73
If we see this code, this tries to process all the sheetdata and later apply nrows, which is very very slow in large files..in-fact in one of the xlsb files, the number of records are around 3k but, it was trying to loop around 100k records (because of xlsb metadata or something) even though I specify nrows as 3k, it has to wait processing all 100k records.
Describe the solution you'd like
If we can pass nrows to get_sheet_data and break the loop while it reaches the nrows number, then it will better.
API breaking implications
I am not really sure.
Describe alternatives you've considered
don't have alternatives at the moment, would like to hear suggestions about this.
Additional context
[add any other context, code examples, or references to existing implementations about the feature request here]
We can set default value for nrows=infinity, incase if we dont want to break for others.
The text was updated successfully, but these errors were encountered: