ENH: Using nrows option while processing xlsb files #39518

code-R · 2021-02-01T06:16:56Z

Is your feature request related to a problem?

I wish pandas can use nrows eagerly while processing xlsx or xlsb files

https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_pyxlsb.py#L73
If we see this code, this tries to process all the sheetdata and later apply nrows, which is very very slow in large files..in-fact in one of the xlsb files, the number of records are around 3k but, it was trying to loop around 100k records (because of xlsb metadata or something) even though I specify nrows as 3k, it has to wait processing all 100k records.

Describe the solution you'd like

If we can pass nrows to get_sheet_data and break the loop while it reaches the nrows number, then it will better.

API breaking implications

I am not really sure.

Describe alternatives you've considered

don't have alternatives at the moment, would like to hear suggestions about this.

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

df = pd.read_excel("sample.xlsb", nrows=100, engine="pyxlsb")

def get_sheet_data(self, sheet, convert_float: bool, nrows: int) -> List[List[Scalar]]:
    res = []
    for index, r in enumerate(sheet.rows(sparse=False)):
        res.append([self._convert_cell(c, convert_float) for c in r])
        if index > nrows:
            break

    return res

We can set default value for nrows=infinity, incase if we dont want to break for others.

code-R · 2021-02-01T06:18:35Z

This is my first time opening issue in pandas, if I have made any mistakes or the issue is not very clear. Please let me know, I can make the necessary corrections.

lithomas1 · 2021-02-02T23:07:45Z

@code-R
Thanks for opening this feature request! Adding nrows to read_excel seems reasonable to me. Would you be interested in putting up a PR for this? (#35974 previously did this but was rolled back)
[Edit]: Looking at this issue again, it seems that this issue is a duplicate of #32727. A PR to optimize nrows for all parsers is still welcome though.

MarcoGorelli · 2021-02-03T09:04:08Z

Thanks @lithomas1 for triaging - closing as duplicate then

code-R added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 1, 2021

lithomas1 added IO Excel read_excel, to_excel Performance Memory or execution speed performance labels Feb 2, 2021

lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 2, 2021

lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021

lithomas1 removed the Enhancement label Feb 3, 2021

lithomas1 removed this from the Contributions Welcome milestone Feb 3, 2021

lithomas1 added the Duplicate Report Duplicate issue or pull request label Feb 3, 2021

MarcoGorelli closed this as completed Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Using nrows option while processing xlsb files #39518

ENH: Using nrows option while processing xlsb files #39518

code-R commented Feb 1, 2021

code-R commented Feb 1, 2021

lithomas1 commented Feb 2, 2021 •

edited

Loading

MarcoGorelli commented Feb 3, 2021

ENH: Using nrows option while processing xlsb files #39518

ENH: Using nrows option while processing xlsb files #39518

Comments

code-R commented Feb 1, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

code-R commented Feb 1, 2021

lithomas1 commented Feb 2, 2021 • edited Loading

MarcoGorelli commented Feb 3, 2021

lithomas1 commented Feb 2, 2021 •

edited

Loading