COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

onesandzeroes · 2016-03-26T11:35:05Z

closes COMPAT: read_excel not supporting pathlib.Path #12655
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

Fairly simple fix overall, since get_filepath_or_buffer is already set up to just convert the path objs to string, we can do that first and then just let the existing logic handle it.

Since the libraries that contain the path objects aren't hard dependencies, I'm not sure how to typecheck against them without first using the _IS_LIBRARY_INSTALLED checks and then importing if they're installed. Let me know if this should be done differently.

jreback · 2016-03-26T12:11:37Z

pandas/io/excel.py

+                io = get_filepath_or_buffer(io)[0]
+        if _PY_PATH_INSTALLED:
+            from py.path import local as LocalPath
+            if isinstance(io, LocalPath):


you don't need any of this
just call get_filepath_or_buffer (actually can prob eliminate almost all of the s3 checking in this function - this is all handled in that function)

Calling get_filepath_or_buffer() on io no matter what almost works, except for URLs where it grabs the URL and returns the data as StringIO, the data needs to be kept as bytes I think. Is that what you were thinking or am I missing something? I'm struggling to find a way to run io through get_filepath_or_buffer() without messing up the URL case or losing track of what we originally had.

The existing logic obviously works and we can keep it as long as we convert path objects to string before hitting that logic.

onesandzeroes · 2016-03-27T02:33:18Z

OK, what if we just use the _stringify_path() function that already exists to avoid the ugly imports? I'm pretty sure ExcelFile needs to do slightly different things with URLs than what is done in get_filepath_or_buffer, in terms of getting the data as string vs bytes.

jreback · 2016-03-29T13:34:32Z

pandas/io/excel.py

@@ -207,6 +210,10 @@ def __init__(self, io, **kwds):
        if engine is not None and engine != 'xlrd':
            raise ValueError("Unknown engine: %s" % engine)

+        # io can be a path object, just convert these to string (doesn't alter
+        # other objects)
+        io = _stringify_path(io)


yeah, I actually really like to make this work with get_filepath_or_buffer directly, rather than special casing (and using this here). IIRC you put a comment why this wasn't going to work. But is there a way to modify things so that we can get it to work?

Because this is the workhorse function and its a shame if anyhing has to deviate as things inevitably get missed.

jreback · 2016-03-29T13:35:22Z

doc/source/whatsnew/v0.18.1.txt

@@ -169,3 +169,4 @@ Bug Fixes

 - Bug in ``pivot_table`` when ``margins=True`` and ``dropna=True`` where nulls still contributed to margin count (:issue:`12577`)
 - Bug in ``Series.name`` when ``name`` attribute can be a hashable type (:issue:`12610`)
+- ``read_excel`` now accepts path objects for the file path, in line with other ``read_*`` functions (:issue:`12655)


need another backtick after the issue number

Also change path to pathlib.Path to be a bit clearer.

jreback · 2016-03-30T12:57:41Z

pandas/io/excel.py

-        elif engine == 'xlrd' and isinstance(io, xlrd.Book):
+        # If io is a url, want to keep the data as bytes so can't pass
+        # to get_filepath_or_buffer()
+        if _is_url(io):


nice. can we do this internally in get_filepath_or_buffer? (looks like it should just work)

It looks like other cases where get_filepath_or_buffer() is used on URLs, getting the data back as StringIO is OK, so at the moment it grabs the data and converts to unicode automatically. We don't want that here, so I guess we would need a flag to tell it not to convert or something?

Couldn't figure out how to do it myself, so left it as a special case in ExcelFile.__init__.

it should return an encoded stream as BytesIO if needed (or StringIO) otherwise. These are hard cases and I believe they all work. If you cann't find a place where you have a failure don't make this a special case (as it would then need special tests and so on).

If I don't special case URLs here, I get failures in URL related tests, get_filepath_or_buffer() returns the data as StringIO and reading from urls fails with errors like:

Traceback (most recent call last): File "/home/me/src/pandas/reading_tests.py", line 8, in <module> url_table = pd.read_excel(url) File "/home/me/src/pandas/pandas/io/excel.py", line 170, in read_excel io = ExcelFile(io, engine=engine) File "/home/me/src/pandas/pandas/io/excel.py", line 225, in __init__ self.book = xlrd.open_workbook(file_contents=data) File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/__init__.py", line 441, in open_workbook ragged_rows=ragged_rows, File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 91, in open_workbook_xls biff_version = bk.getbof(XL_WORKBOOK_GLOBALS) File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 1226, in getbof opcode = self.get2bytes() File "/home/me/anaconda3/lib/python3.5/site-packages/xlrd/book.py", line 631, in get2bytes return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo) TypeError: unsupported operand type(s) for <<: 'str' and 'int'

I haven't looked closely at the changes in this PR (will followup later), but I'm partway through refactoring our S3 handling to use s3fs, so I might be changing this soon anyway. My branch is here.

@onesandzeroes ok. well you did clean things up a bit. pls rebase / squash. ping when green. @TomAugspurger can address if necessary later.

Convert path objects to string before attempting to read Update docstrings read_excel and ExcelFile Add fix to release notes pep8 formatting Just stringify path objects using existing function Condense logic for different file/path types

onesandzeroes · 2016-03-31T00:17:45Z

@jreback Squashed and tests are passing, do I need to worry about the coverage checker? Haven't seen that before.

jreback · 2016-03-31T13:41:30Z

@onesandzeroes I think its complaining (the cov checker) about not testing the is_url calls. Can you review the tests and make sure we are testing pulling from url and s3 ok? (I think we should be). ping if ok.

onesandzeroes · 2016-03-31T22:07:34Z

@jreback yeah there are tests for both http and file url's, and for s3 as well, those paths should be covered.

jreback · 2016-04-01T13:03:09Z

@onesandzeroes thanks!

jreback reviewed Mar 26, 2016
View reviewed changes

jreback reviewed Mar 29, 2016
View reviewed changes

jreback added Compat pandas objects compatability with Numpy or Python functions IO Excel read_excel, to_excel labels Mar 29, 2016

jreback added this to the 0.18.1 milestone Mar 29, 2016

jreback reviewed Mar 29, 2016
View reviewed changes

onesandzeroes force-pushed the excel-read branch from d196854 to 9716337 Compare March 30, 2016 12:23

jreback reviewed Mar 30, 2016
View reviewed changes

onesandzeroes force-pushed the excel-read branch from 9716337 to ad8a189 Compare March 30, 2016 22:01

jreback closed this in e61241c Apr 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

onesandzeroes commented Mar 26, 2016

jreback Mar 26, 2016

onesandzeroes Mar 26, 2016

onesandzeroes commented Mar 27, 2016

jreback Mar 29, 2016

jreback Mar 29, 2016

TomAugspurger Mar 30, 2016

jreback Mar 30, 2016

onesandzeroes Mar 30, 2016

jreback Mar 30, 2016

onesandzeroes Mar 30, 2016

TomAugspurger Mar 30, 2016

jreback Mar 30, 2016

onesandzeroes commented Mar 31, 2016

jreback commented Mar 31, 2016

onesandzeroes commented Mar 31, 2016

jreback commented Apr 1, 2016

COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

COMPAT: make read_excel accept path objects for the filepath (#12655) #12717

Conversation

onesandzeroes commented Mar 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onesandzeroes commented Mar 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onesandzeroes commented Mar 31, 2016

jreback commented Mar 31, 2016

onesandzeroes commented Mar 31, 2016

jreback commented Apr 1, 2016