Skip to content

BadZipFile error when using read_excel on .xlsx #26813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
OD1995 opened this issue Jun 12, 2019 · 13 comments
Closed

BadZipFile error when using read_excel on .xlsx #26813

OD1995 opened this issue Jun 12, 2019 · 13 comments

Comments

@OD1995
Copy link

OD1995 commented Jun 12, 2019

Code Sample, a copy-pastable example if possible

UKregions = pd.read_excel(r"K:\Sport\Sponsors\Lookups.xlsx",sheet_name="UKregions")

Problem description

Traceback (most recent call last):

  File "<ipython-input-3-acf52be7bb80>", line 1, in <module>
    UKregions0 = pd.read_excel(r"K:\Sport\Sponsors\Adidas\2018 - PSOV\Lookups.xlsx",sheet_name="UKregions")

  File "E:\ANACONDA\lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
    return func(*args, **kwargs)

  File "E:\ANACONDA\lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
    return func(*args, **kwargs)

  File "E:\ANACONDA\lib\site-packages\pandas\io\excel.py", line 350, in read_excel
    io = ExcelFile(io, engine=engine)

  File "E:\ANACONDA\lib\site-packages\pandas\io\excel.py", line 653, in __init__
    self._reader = self._engines[engine](self._io)

  File "E:\ANACONDA\lib\site-packages\pandas\io\excel.py", line 424, in __init__
    self.book = xlrd.open_workbook(filepath_or_buffer)

  File "E:\ANACONDA\lib\site-packages\xlrd\__init__.py", line 117, in open_workbook
    zf = zipfile.ZipFile(filename)

  File "E:\ANACONDA\lib\zipfile.py", line 1131, in __init__
    self._RealGetContents()

  File "E:\ANACONDA\lib\zipfile.py", line 1198, in _RealGetContents
    raise BadZipFile("File is not a zip file")

BadZipFile: File is not a zip file

Expected Output

A pandas DataFrame

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.24.2
pytest: 4.6.2
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.10
numpy: 1.16.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: 2.1.0
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.2
numexpr: 2.6.9
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.4
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 12, 2019 via email

@OD1995
Copy link
Author

OD1995 commented Jun 13, 2019

@TomAugspurger No, code and error below:

Code:

A = xlrd.open_workbook(r"K:\Sport\Sponsors\Lookups.xlsx")

Error:

Traceback (most recent call last):

  File "<ipython-input-7-fd902e6b151e>", line 1, in <module>
    odf = xlrd.open_workbook(lookupsPath)

  File "E:\ANACONDA\lib\site-packages\xlrd\__init__.py", line 117, in open_workbook
    zf = zipfile.ZipFile(filename)

  File "E:\ANACONDA\lib\zipfile.py", line 1131, in __init__
    self._RealGetContents()

  File "E:\ANACONDA\lib\zipfile.py", line 1198, in _RealGetContents
    raise BadZipFile("File is not a zip file")

BadZipFile: File is not a zip file

@WillAyd
Copy link
Member

WillAyd commented Jun 13, 2019

Unfortunately this isn’t a pandas issue - you would have to open with xlrd

@WillAyd WillAyd closed this as completed Jun 13, 2019
@cjw296
Copy link
Contributor

cjw296 commented Jun 13, 2019

@WillAyd - please can you stop offloading your project's problems onto me?

I've made the position with xlrd pretty clear:
#26487 (the issue you closed)
#11499

Thanks.

@cjw296
Copy link
Contributor

cjw296 commented Jun 13, 2019

@TomAugspurger / @WillAyd - more frustratingly, this is exactly the kind of non-issue that I'd like to avoid having dumped on the xlrd project: the error's pretty clear - if the file can't even be unzipped there is zero chance it's a valid xlsx, so why just say "well, this must be an xlrd problem, please go complain there"?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 13, 2019 via email

@cjw296
Copy link
Contributor

cjw296 commented Jun 13, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 13, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Jun 13, 2019

Hey @cjw296 ! Just want to clarify my intention here - I'm not trying to "offload problems" or create more work. I'm just trying to route issues to where they can be definitively addressed. Sure I had an inclination that this is probably just an issue with the users' file but at the same time I don't know xlrd's code base nor am I as intimately familiar with the structure of Excel files to say for sure.

With regards to your position on xlrd there's been a lot of work recently to decouple that dependency and get openpyxl in as a valid reader (you can check the label IO Excel in our PR tracker if you care to see). A lot of the code here has been contributed by a variety of users over the course of years, so its one of those things where it is probably more work than you would think or hope for to get openpyxl in to place.

Getting something else besides xlrd for reading isn't a request we are ignoring, but like anything else its just taking a little bit of time to get there. Your patience while we work through that is certainly appreciated (from PR you linked you should see we are getting closer) and obviously if you have any particular contributions you'd like to make via PRs or reviews we would love that as well.

@cjw296
Copy link
Contributor

cjw296 commented Jun 13, 2019

@WillAyd - the problem with xlrd, and it's one of the things that has burned me and John out, is dealing with careless users who can't be bothered to read exceptions or even check they have a valid excel file before complaining. After that, it becomes people who want it not to be their problem that their source data is corrupted and invalid, and because some other library or program happens to be able to deal with their corrupt data, they demand it be fixed at our expense in xlrd.

So, high level:

  • If people have xlsx, tell them to use openpyxl, do not complain on the xlrd tracker about xlsx issues, many more and it will persuade me to actually rip out the xlsx support completely and do a new release.

  • if it's xls, assume it's a data issue. Even if it's not, chances of it being fixed in xlrd without the OP supplying a PR with a simple, sanitized example file along with unit tests are pretty much zero.

  • please try and dissuade people who complain on this tracker from opening issues against xlrd. If they're not clued up enough to go to the xlrd codebase in the first place, they're unlikely to get any joy by you directing them there.

I'm sorry to have to be so blunt about this, but myself and John have tried the subtle approach over the years and it's hasn't worked.

@goforaditya
Copy link

For me switching to the newest version of Python and Anaconda resolved the issue. Got the error while working on Python 2.7 now updated to 3.7.4

@SharnamK
Copy link

SharnamK commented Apr 27, 2021

Perform these quick sanity checks:

  1. Open the excel file. Is the data appearing correctly?
  2. Are you able to see the file size in the file's details in Windows Explorer?

In my case, I manually checked the excel file content and it turns out it was empty because I was not storing the file correctly. Once I fixed this, the "File is not a zip file" error got resolved.

@ggprod
Copy link

ggprod commented Dec 23, 2021

I recently ran into a similar issue. I had uploaded an .xlsx file to Google Cloud storage and used pandas to read the file pandas.read_excel method passing the Google Cloud storage location. This works fine when the file is uploaded normally to GCS but fails if the same file is uploaded to GCS gzip compressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants