Skip to content

ENH: Using nrows option while processing xlsb files #39518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
code-R opened this issue Feb 1, 2021 · 3 comments
Closed

ENH: Using nrows option while processing xlsb files #39518

code-R opened this issue Feb 1, 2021 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request IO Excel read_excel, to_excel Performance Memory or execution speed performance

Comments

@code-R
Copy link

code-R commented Feb 1, 2021

Is your feature request related to a problem?

I wish pandas can use nrows eagerly while processing xlsx or xlsb files

https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_pyxlsb.py#L73
If we see this code, this tries to process all the sheetdata and later apply nrows, which is very very slow in large files..in-fact in one of the xlsb files, the number of records are around 3k but, it was trying to loop around 100k records (because of xlsb metadata or something) even though I specify nrows as 3k, it has to wait processing all 100k records.

Describe the solution you'd like

If we can pass nrows to get_sheet_data and break the loop while it reaches the nrows number, then it will better.

API breaking implications

I am not really sure.

Describe alternatives you've considered

don't have alternatives at the moment, would like to hear suggestions about this.

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

df = pd.read_excel("sample.xlsb", nrows=100, engine="pyxlsb")
def get_sheet_data(self, sheet, convert_float: bool, nrows: int) -> List[List[Scalar]]:
    res = []
    for index, r in enumerate(sheet.rows(sparse=False)):
        res.append([self._convert_cell(c, convert_float) for c in r])
        if index > nrows:
            break

    return res

We can set default value for nrows=infinity, incase if we dont want to break for others.

@code-R code-R added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 1, 2021
@code-R
Copy link
Author

code-R commented Feb 1, 2021

This is my first time opening issue in pandas, if I have made any mistakes or the issue is not very clear. Please let me know, I can make the necessary corrections.

@lithomas1 lithomas1 added IO Excel read_excel, to_excel Performance Memory or execution speed performance labels Feb 2, 2021
@lithomas1
Copy link
Member

lithomas1 commented Feb 2, 2021

@code-R
Thanks for opening this feature request! Adding nrows to read_excel seems reasonable to me. Would you be interested in putting up a PR for this? (#35974 previously did this but was rolled back)
[Edit]: Looking at this issue again, it seems that this issue is a duplicate of #32727. A PR to optimize nrows for all parsers is still welcome though.

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 2, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021
@lithomas1 lithomas1 removed this from the Contributions Welcome milestone Feb 3, 2021
@lithomas1 lithomas1 added the Duplicate Report Duplicate issue or pull request label Feb 3, 2021
@MarcoGorelli
Copy link
Member

Thanks @lithomas1 for triaging - closing as duplicate then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants