Skip to content

BUG: read_excel blows the memory when using openpyxl engine #40569

Closed
@liyucheng09

Description

@liyucheng09

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
Python 3.9.1 | packaged by conda-forge | (default, Dec  9 2020, 01:07:47) 
[Clang 11.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'1.2.0'
>>> df=pd.read_excel('full_data.xlsx')

Problem description

I am not quite sure how to describe the bug, the code just got stuck when I run pd.read_excel('full_data.xlsx'). I found this line cost a significant amount of memories (almost 14G but the .xlsx file is just 9MB).

I speculate it is result from read_excel now leverage openpyxl as default engine in python3.9. Loading this file in python3.8 works fine.

>>> from openpyxl import load_workbook
>>> wb=load_workbook('full_data.xlsx')
>>> df=pd.DataFrame(wb['Sheet1'].values)

The above codes also leads to the same issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions