ENH Pass kwds from ExcelFile ctr to xlrd.open_workbook. For example, thi... #4439

cancan101 · 2013-08-02T04:14:51Z

...s allows setting formatting_info=True (GH4438)

…this allows setting formatting_info=True (GH4438)

jreback · 2013-08-02T12:15:12Z

pandas/io/excel.py

@@ -78,10 +78,10 @@ def __init__(self, path_or_buf, kind=None, **kwds):
        self.tmpfile = None

        if isinstance(path_or_buf, compat.string_types):
-            self.book = xlrd.open_workbook(path_or_buf)


should these be named kwds? or can open_workbook deal with any?

xlrd.open_workbook accepts a number of parameters: http://www.lexicon.net/sjmachin/xlrd.html#xlrd.open_workbook-function

That being said I saw the kwd argument in the ExcelFile constructor. Currently that kwd is not used which means if the user tries to pass any additional parameters, or misspells them, they are silently dropped.

This change not only allows passing more arguments down to xlrd but also adds an additional runtime check.

cpcloud · 2013-08-03T16:50:41Z

@cancan101 Please write a test! 😄 I find it very helpful to write a test before I make a change (a hard habit to get into especially if you think the issue/change is small, but it will help you in the long run)

Also please add some documentation showing users that you can now pass keyword arguments when reading excel files.

cancan101 · 2013-08-04T21:05:31Z

@cpcloud that make sense. I will do. That being said, should I rename the parameter on ExcelFile to be something more informative and less specific to xlrd? perhaps "find_merged_cells" as opposed to formatting info ? That way parameter can be kept the same if underlying library changes.

cancan101 · 2013-08-16T23:29:19Z

@cpcloud Unit test pushed

so that merged cells may be read (GH4438)

cancan101 · 2013-08-26T00:58:31Z

@cpcloud @jreback Other than release notes, anything else I should do to get this merged?

jtratner · 2013-08-27T18:37:14Z

Add a note to release.rst and squash this into one commit and I'll try to review in the interim.

cancan101 · 2013-08-27T18:51:17Z

This goes back to the question about xlrd dependence. I think my commit is still valid, but do we want to add an additional parameter to ExcelFile, something like load_merged_cells that is not specific to xlrd, which formatting_info=True is.

That way if xlrd is dropped as the Excel file reader, load_merged_cells can be mapped to the new library?

cancan101 · 2013-08-27T18:55:17Z

Or should formatting_info=True be on by default given the plans to support MultiIndex?

cancan101 · 2013-08-28T03:21:48Z

To be honest, we could probably hide this entire parameter from the user. If we see a need to load merged cells, such as when header is set to a be an iterable of length > 1, we could pass formatting_info=True to xlrd and then use get_effective_cell (see: #4673) to read the cells.

Thoughts?

jtratner · 2013-08-28T04:13:17Z

Couldn't the test case for this just check that merged cells are produced? Does the content matter (seems like it's just what xlrd does)?

cancan101 · 2013-08-28T04:14:20Z

@jtratner I am not sure what you mean.

jtratner · 2013-08-28T04:19:13Z

It's fine to pass keywords through (and that's reasonable), but 'getting the value of merged cells' is probably not something pandas should be set to do. At that point, you could just use xlrd directly.

jtratner · 2013-08-28T04:21:25Z

To be clear I mean: converting merged cells into something reasonable in the outputted dataframe makes sense. It doesn't make sense to add something to the public api targeted specifically at merged cells.

cancan101 · 2013-08-28T04:21:50Z

@jtratner I am still not sure what you mean. Maybe I was unclear with my comment: I was just saying there might be a smart way of allowing us to parse merged cells in the ExcelParser without the user having to specify formatting_info=True

cancan101 · 2013-08-28T04:22:23Z

@jtratner I think we are in agreement. When I said "hide the parameter" what I should have said was not "not add the parameter to the public API"

jtratner · 2013-08-30T21:50:20Z

pandas/io/tests/test_excel.py

+        merged_cells = sheet.merged_cells
+
+        self.assertEquals(len(merged_cells), 1)
+        rlo, rhi, clo, chi = merged_cells[0]


is it necessary to test the actual parameters? merged_cells won't even be on the object if you don't pass formatting_info=True

It is an object, the length is just =0

let's close this PR. Just create a new one that tests that you can pass an xlrd workbook to the xlrd reader and that it all works. Then none of these tests will be necessary.

jtratner · 2013-08-30T21:55:58Z

@cancan101 a few final things to get this accepted:

please revise your test case to be shorter (i.e., just check that merged_cells is there and has length 1 - the rest is just an unnecessary smoke test for xlrd).
Please add a note to release.rst documenting what you've changed
Add a note to the docstring for ExcelFile that specifies that kwds are passed directly to the xlrd constructor. I think you should also note that it's not guaranteed that pandas will continue to use xlrd in the future so this is discouraged. - not sure - what do you think @jreback @cpcloud ?
Please squash this down to 1 commit.

jreback · 2013-08-30T22:00:34Z

do u think have a flavor (or maybe engine?) argument is useful here (to specify the backend) , eg would be 'xlrd' in this case (so we can offer back compat at some point if needed), or overkill ?

jtratner · 2013-08-30T22:08:10Z

@jreback I think there are already PRs for using other readers/writers for Excel, so not necessarily a bad thing to have. I just don't want to be tied to supporting any keyword arguments that aren't part of the public API.

jreback · 2013-09-22T20:23:25Z

@jtratner is this obsolete? (as in your new framework already does this)?

jtratner · 2013-09-22T20:24:09Z

no, still up for grabs as to what to do (didn't refactor ExcelFile)

jreback · 2013-09-22T20:25:25Z

gr8, then is PR look ok?

cancan101 · 2013-09-24T02:38:46Z

@jtratner What is the best way to make this future proof to changes in the Excel parser (ie moving away from xlrd)?

jtratner · 2013-09-24T03:09:45Z

@cancan101 I think it would be better to just allow read_excel to accept workbooks too. That way, you can do whatever you want to the table first and there's no requirement that read_excel itself expose things that aren't directly necessary for creating pandas objects.

cancan101 · 2013-09-24T03:33:21Z

@jtratner As a slight modification to your suggestion:

Change ExcelFile to accept a workbook for the path_or_buf argument. That will also mean that read_excel accepts workbooks too.

jtratner · 2013-09-24T03:33:58Z

yep, that's what I was thinking.

jtratner · 2013-09-24T03:43:05Z

Closing in favor of adding an option to let ExcelFile take an existing workbook.

cancan101 · 2013-09-24T03:44:02Z

@jtratner Can you leave this PR open? I was about to push fix / change to allow passing in sheet.

jtratner · 2013-09-24T03:45:15Z

you should create a new PR for that, the discussion on this was about passing kwargs...confusing to do both. Plus, there was the additional discussion about the test cases.

jtratner · 2013-09-24T03:45:38Z

btw - this is why it's good to do topic branches instead of directly on your own master. Lets you have multiple PRs open.

cancan101 · 2013-09-24T03:46:08Z

Do you mind if I leave a test case that tests both:

    def test_excel_read_merged_cells(self):
        _skip_if_no_xlrd()

        import xlrd

        pth = os.path.join(self.dirpath, 'merged.xls')
        xls = ExcelFile(xlrd.open_workbook(pth, formatting_info=True))        
        book = xls.book
        sheet = book.sheet_by_index(0)
        merged_cells = sheet.merged_cells

        self.assertEquals(len(merged_cells), 1)

cancan101 · 2013-09-24T03:46:48Z

@jtratner Yea, using master was a mistake. I usually use topic branches. Once pushed, github does not allow changing branch for PR.

jtratner · 2013-09-24T03:50:05Z

I don't think that's a good test case - that doesn't really tell you anything different than the following:

pth = os.path.join(self.dirpath, 'merged.xls')
book = xlrd.open_workbook(pth)
xls = ExcelFile(book)        
self.assert_(book is xls.book)

The merged cells are available because the user created the workbook instance. You should add a parse check to make sure everything works (as well as sheet name lookup, etc.)

cancan101 · 2013-09-24T03:52:31Z

Okay. Given that the user is now using xlrd directly to create the work book, I guess it does not make sense to bother testing the merged cell functionality; we can assume that xlrd handles that correctly.

cancan101 · 2013-09-24T03:57:34Z

@jtratner Is there one of the current Excel tests that you think would be best to duplicate and have run using an ExcelFile opened from a book?

jtratner · 2013-09-24T03:59:34Z

can't just do something like:

with ensure_clean(pth) as pth:
    df.to_excel(pth, "SheetA")
    book = xlrd.open_workbook(pth)
    xl = ExcelFile(book)
    result = xl.parse("SheetA")
    assert_frame_equal(df, result)

Where, for now, df is something with floats...

cancan101 · 2013-09-24T05:01:36Z

Created #4961

ENH Pass kwds from ExcelFile ctr to xlrd.open_workbook. For example, …

3360951

…this allows setting formatting_info=True (GH4438)

jreback reviewed Aug 2, 2013
View reviewed changes

ENH Unit test for setting formatting_info=True when creating ExcelFile

820bc5d

so that merged cells may be read (GH4438)

cancan101 mentioned this pull request Aug 27, 2013

ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679

Closed

ghost assigned jtratner Aug 27, 2013

ghost assigned ghost and jtratner Aug 28, 2013

jtratner reviewed Aug 30, 2013
View reviewed changes

jtratner closed this Sep 24, 2013

ENH Pass kwds from ExcelFile ctr to xlrd.open_workbook. For example, thi... #4439

ENH Pass kwds from ExcelFile ctr to xlrd.open_workbook. For example, thi... #4439

Conversation

cancan101 commented Aug 2, 2013

jreback Aug 2, 2013

Choose a reason for hiding this comment

cancan101 Aug 2, 2013

Choose a reason for hiding this comment

cpcloud commented Aug 3, 2013

cancan101 commented Aug 4, 2013

cancan101 commented Aug 16, 2013

cancan101 commented Aug 26, 2013

jtratner commented Aug 27, 2013

cancan101 commented Aug 27, 2013

cancan101 commented Aug 27, 2013

cancan101 commented Aug 28, 2013

jtratner commented Aug 28, 2013

cancan101 commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

cancan101 commented Aug 28, 2013

cancan101 commented Aug 28, 2013

jtratner Aug 30, 2013

Choose a reason for hiding this comment

cancan101 Sep 24, 2013

Choose a reason for hiding this comment

jtratner Sep 24, 2013

Choose a reason for hiding this comment

jtratner commented Aug 30, 2013

jreback commented Aug 30, 2013

jtratner commented Aug 30, 2013

jreback commented Sep 22, 2013

jtratner commented Sep 22, 2013

jreback commented Sep 22, 2013

cancan101 commented Sep 24, 2013

jtratner commented Sep 24, 2013

cancan101 commented Sep 24, 2013

jtratner commented Sep 24, 2013

jtratner commented Sep 24, 2013

cancan101 commented Sep 24, 2013

jtratner commented Sep 24, 2013

jtratner commented Sep 24, 2013

cancan101 commented Sep 24, 2013

cancan101 commented Sep 24, 2013

jtratner commented Sep 24, 2013

cancan101 commented Sep 24, 2013

cancan101 commented Sep 24, 2013

jtratner commented Sep 24, 2013

cancan101 commented Sep 24, 2013