Skip to content

COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

onesandzeroes
Copy link
Contributor

Fairly simple fix overall, since get_filepath_or_buffer is already set up to just convert the path objs to string, we can do that first and then just let the existing logic handle it.

Since the libraries that contain the path objects aren't hard dependencies, I'm not sure how to typecheck against them without first using the _IS_LIBRARY_INSTALLED checks and then importing if they're installed. Let me know if this should be done differently.

io = get_filepath_or_buffer(io)[0]
if _PY_PATH_INSTALLED:
from py.path import local as LocalPath
if isinstance(io, LocalPath):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need any of this
just call get_filepath_or_buffer (actually can prob eliminate almost all of the s3 checking in this function - this is all handled in that function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling get_filepath_or_buffer() on io no matter what almost works, except for URLs where it grabs the URL and returns the data as StringIO, the data needs to be kept as bytes I think. Is that what you were thinking or am I missing something? I'm struggling to find a way to run io through get_filepath_or_buffer() without messing up the URL case or losing track of what we originally had.

The existing logic obviously works and we can keep it as long as we convert path objects to string before hitting that logic.

@onesandzeroes
Copy link
Contributor Author

OK, what if we just use the _stringify_path() function that already exists to avoid the ugly imports? I'm pretty sure ExcelFile needs to do slightly different things with URLs than what is done in get_filepath_or_buffer, in terms of getting the data as string vs bytes.

@@ -207,6 +210,10 @@ def __init__(self, io, **kwds):
if engine is not None and engine != 'xlrd':
raise ValueError("Unknown engine: %s" % engine)

# io can be a path object, just convert these to string (doesn't alter
# other objects)
io = _stringify_path(io)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I actually really like to make this work with get_filepath_or_buffer directly, rather than special casing (and using this here). IIRC you put a comment why this wasn't going to work. But is there a way to modify things so that we can get it to work?

Because this is the workhorse function and its a shame if anyhing has to deviate as things inevitably get missed.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel labels Mar 29, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 29, 2016
@@ -169,3 +169,4 @@ Bug Fixes

- Bug in ``pivot_table`` when ``margins=True`` and ``dropna=True`` where nulls still contributed to margin count (:issue:`12577`)
- Bug in ``Series.name`` when ``name`` attribute can be a hashable type (:issue:`12610`)
- ``read_excel`` now accepts path objects for the file path, in line with other ``read_*`` functions (:issue:`12655)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need another backtick after the issue number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change path to pathlib.Path to be a bit clearer.

elif engine == 'xlrd' and isinstance(io, xlrd.Book):
# If io is a url, want to keep the data as bytes so can't pass
# to get_filepath_or_buffer()
if _is_url(io):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. can we do this internally in get_filepath_or_buffer? (looks like it should just work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like other cases where get_filepath_or_buffer() is used on URLs, getting the data back as StringIO is OK, so at the moment it grabs the data and converts to unicode automatically. We don't want that here, so I guess we would need a flag to tell it not to convert or something?

Couldn't figure out how to do it myself, so left it as a special case in ExcelFile.__init__.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should return an encoded stream as BytesIO if needed (or StringIO) otherwise. These are hard cases and I believe they all work. If you cann't find a place where you have a failure don't make this a special case (as it would then need special tests and so on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't special case URLs here, I get failures in URL related tests, get_filepath_or_buffer() returns the data as StringIO and reading from urls fails with errors like:

Traceback (most recent call last):
  File "/home/me/src/pandas/reading_tests.py", line 8, in <module>
    url_table = pd.read_excel(url)
  File "/home/me/src/pandas/pandas/io/excel.py", line 170, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/home/me/src/pandas/pandas/io/excel.py", line 225, in __init__
    self.book = xlrd.open_workbook(file_contents=data)
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 91, in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 1226, in getbof
    opcode = self.get2bytes()
  File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 631, in get2bytes
    return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo)
TypeError: unsupported operand type(s) for <<: 'str' and 'int'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely at the changes in this PR (will followup later), but I'm partway through refactoring our S3 handling to use s3fs, so I might be changing this soon anyway. My branch is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@onesandzeroes ok. well you did clean things up a bit. pls rebase / squash. ping when green. @TomAugspurger can address if necessary later.

Convert path objects to string before attempting to read

Update docstrings read_excel and ExcelFile

Add fix to release notes

pep8 formatting

Just stringify path objects using existing function

Condense logic for different file/path types
@onesandzeroes
Copy link
Contributor Author

@jreback Squashed and tests are passing, do I need to worry about the coverage checker? Haven't seen that before.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

@onesandzeroes I think its complaining (the cov checker) about not testing the is_url calls. Can you review the tests and make sure we are testing pulling from url and s3 ok? (I think we should be). ping if ok.

@onesandzeroes
Copy link
Contributor Author

@jreback yeah there are tests for both http and file url's, and for s3 as well, those paths should be covered.

@jreback jreback closed this in e61241c Apr 1, 2016
@jreback
Copy link
Contributor

jreback commented Apr 1, 2016

@onesandzeroes thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COMPAT: read_excel not supporting pathlib.Path
3 participants