Skip to content

ENH read_excel error when accessing AWS S3 URL #11447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kingishb opened this issue Oct 27, 2015 · 2 comments · Fixed by #11714
Closed

ENH read_excel error when accessing AWS S3 URL #11447

kingishb opened this issue Oct 27, 2015 · 2 comments · Fixed by #11714
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Milestone

Comments

@kingishb
Copy link

Summary: read_excel is unable to read a file using the same S3 URL syntax as read_csv. read_excel should support accessing S3 data in the same manner as read_csv

read_excel fails with the following error:

>>> import pandas as pd
>>> df = pd.read_excel("s3://my-bucket/my_file.xlsx")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python2.6/site-packages/pandas/io/excel.py", line 163, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/usr/local/lib64/python2.6/site-packages/pandas/io/excel.py", line 206, in __init__
    self.book = xlrd.open_workbook(io)
  File "/usr/local/lib/python2.6/site-packages/xlrd/__init__.py", line 394, in open_workbook
    f = open(filename, "rb")
IOError: [Errno 2] No such file or directory: 's3://my-bucket/my_file.xlsx'
>>> 

read_csv on the other hand is able to successfully read a csv file in the same S3 bucket using the same URL syntax:

>>> import pandas as pd
>>> df = pd.read_csv("s3://my-bucket/my_file.csv")
>>> len(df.index)
1187
>>>

For the record, read_csv can also see the xlsx file but returns parse errors when attempting to tokenize the data.

>>> import pandas as pd
>>> df = pd.read_csv("s3://my-bucket/my_file.xlsx")
Exception pandas.parser.CParserError: CParserError('Error tokenizing data. C error: Expected 9 fields in line 210, saw 10\n',) in 'pandas.parser.TextReader._tokenize_rows' ignored
>>> 

read_excel successfully reads and parses a local copy of the xlsx file

>>> import pandas as pd
>>> df = pd.read_excel("my_file.xlsx")
>>> len(df.index)
221
>>> 

Pandas version string and dependencies:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.14.48-33.39.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.4
pip: 6.1.1
setuptools: 12.2
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
>>> 
@jreback jreback changed the title BugReport: read_excel error when accessing AWS S3 URL ENH read_excel error when accessing AWS S3 URL Oct 27, 2015
@jreback jreback added IO Data IO issues that don't fit into a more specific label Difficulty Novice IO Excel read_excel, to_excel labels Oct 27, 2015
@jreback jreback added this to the Next Major Release milestone Oct 27, 2015
@jreback
Copy link
Contributor

jreback commented Oct 27, 2015

This is almost a trivial enhancement, just add _is_s3_url here

post a file link as an example and i'll put it on our test s3 bucket.

@parsleyt
Copy link
Contributor

@kingishb opened up this issue on my behalf. My recent pull request #11714 should resolve the issue.

The attached sample excel file can be used for testing purposes: sample.xlsx See pull request for details on test files.

@jreback jreback modified the milestones: 0.18.0, Next Major Release Nov 29, 2015
parsleyt added a commit to parsleyt/pandas that referenced this issue Dec 1, 2015
jreback added a commit that referenced this issue Dec 1, 2015
ENH: Add support for s3 in read_excel #11447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants