-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
COMPAT: make read_excel accept path objects for the filepath (#12655) #12717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
io = get_filepath_or_buffer(io)[0] | ||
if _PY_PATH_INSTALLED: | ||
from py.path import local as LocalPath | ||
if isinstance(io, LocalPath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need any of this
just call get_filepath_or_buffer (actually can prob eliminate almost all of the s3 checking in this function - this is all handled in that function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling get_filepath_or_buffer()
on io
no matter what almost works, except for URLs where it grabs the URL and returns the data as StringIO, the data needs to be kept as bytes I think. Is that what you were thinking or am I missing something? I'm struggling to find a way to run io
through get_filepath_or_buffer()
without messing up the URL case or losing track of what we originally had.
The existing logic obviously works and we can keep it as long as we convert path objects to string before hitting that logic.
OK, what if we just use the |
@@ -207,6 +210,10 @@ def __init__(self, io, **kwds): | |||
if engine is not None and engine != 'xlrd': | |||
raise ValueError("Unknown engine: %s" % engine) | |||
|
|||
# io can be a path object, just convert these to string (doesn't alter | |||
# other objects) | |||
io = _stringify_path(io) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I actually really like to make this work with get_filepath_or_buffer
directly, rather than special casing (and using this here). IIRC you put a comment why this wasn't going to work. But is there a way to modify things so that we can get it to work?
Because this is the workhorse function and its a shame if anyhing has to deviate as things inevitably get missed.
@@ -169,3 +169,4 @@ Bug Fixes | |||
|
|||
- Bug in ``pivot_table`` when ``margins=True`` and ``dropna=True`` where nulls still contributed to margin count (:issue:`12577`) | |||
- Bug in ``Series.name`` when ``name`` attribute can be a hashable type (:issue:`12610`) | |||
- ``read_excel`` now accepts path objects for the file path, in line with other ``read_*`` functions (:issue:`12655) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need another backtick after the issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also change path to pathlib.Path
to be a bit clearer.
d196854
to
9716337
Compare
elif engine == 'xlrd' and isinstance(io, xlrd.Book): | ||
# If io is a url, want to keep the data as bytes so can't pass | ||
# to get_filepath_or_buffer() | ||
if _is_url(io): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. can we do this internally in get_filepath_or_buffer
? (looks like it should just work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like other cases where get_filepath_or_buffer()
is used on URLs, getting the data back as StringIO
is OK, so at the moment it grabs the data and converts to unicode automatically. We don't want that here, so I guess we would need a flag to tell it not to convert or something?
Couldn't figure out how to do it myself, so left it as a special case in ExcelFile.__init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should return an encoded stream as BytesIO
if needed (or StringIO
) otherwise. These are hard cases and I believe they all work. If you cann't find a place where you have a failure don't make this a special case (as it would then need special tests and so on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't special case URLs here, I get failures in URL related tests, get_filepath_or_buffer()
returns the data as StringIO
and reading from urls fails with errors like:
Traceback (most recent call last):
File "/home/me/src/pandas/reading_tests.py", line 8, in <module>
url_table = pd.read_excel(url)
File "/home/me/src/pandas/pandas/io/excel.py", line 170, in read_excel
io = ExcelFile(io, engine=engine)
File "/home/me/src/pandas/pandas/io/excel.py", line 225, in __init__
self.book = xlrd.open_workbook(file_contents=data)
File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 1226, in getbof
opcode = self.get2bytes()
File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 631, in get2bytes
return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo)
TypeError: unsupported operand type(s) for <<: 'str' and 'int'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@onesandzeroes ok. well you did clean things up a bit. pls rebase / squash. ping when green. @TomAugspurger can address if necessary later.
Convert path objects to string before attempting to read Update docstrings read_excel and ExcelFile Add fix to release notes pep8 formatting Just stringify path objects using existing function Condense logic for different file/path types
9716337
to
ad8a189
Compare
@jreback Squashed and tests are passing, do I need to worry about the coverage checker? Haven't seen that before. |
@onesandzeroes I think its complaining (the cov checker) about not testing the |
@jreback yeah there are tests for both http and file url's, and for s3 as well, those paths should be covered. |
@onesandzeroes thanks! |
git diff upstream/master | flake8 --diff
Fairly simple fix overall, since
get_filepath_or_buffer
is already set up to just convert the path objs to string, we can do that first and then just let the existing logic handle it.Since the libraries that contain the path objects aren't hard dependencies, I'm not sure how to typecheck against them without first using the
_IS_LIBRARY_INSTALLED
checks and then importing if they're installed. Let me know if this should be done differently.