Skip to content

ENH Pass kwds from ExcelFile ctr to xlrd.open_workbook. For example, thi... #4439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cancan101
Copy link
Contributor

...s allows setting formatting_info=True (GH4438)

closes #4438

…this allows setting formatting_info=True (GH4438)
@@ -78,10 +78,10 @@ def __init__(self, path_or_buf, kind=None, **kwds):
self.tmpfile = None

if isinstance(path_or_buf, compat.string_types):
self.book = xlrd.open_workbook(path_or_buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be named kwds? or can open_workbook deal with any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xlrd.open_workbook accepts a number of parameters: http://www.lexicon.net/sjmachin/xlrd.html#xlrd.open_workbook-function

That being said I saw the kwd argument in the ExcelFile constructor. Currently that kwd is not used which means if the user tries to pass any additional parameters, or misspells them, they are silently dropped.

This change not only allows passing more arguments down to xlrd but also adds an additional runtime check.

@cpcloud
Copy link
Member

cpcloud commented Aug 3, 2013

@cancan101 Please write a test! 😄 I find it very helpful to write a test before I make a change (a hard habit to get into especially if you think the issue/change is small, but it will help you in the long run)

Also please add some documentation showing users that you can now pass keyword arguments when reading excel files.

@cancan101
Copy link
Contributor Author

@cpcloud that make sense. I will do. That being said, should I rename the parameter on ExcelFile to be something more informative and less specific to xlrd? perhaps "find_merged_cells" as opposed to formatting info ? That way parameter can be kept the same if underlying library changes.

@cancan101
Copy link
Contributor Author

@cpcloud Unit test pushed

@cancan101
Copy link
Contributor Author

@cpcloud @jreback Other than release notes, anything else I should do to get this merged?

@jtratner
Copy link
Contributor

Add a note to release.rst and squash this into one commit and I'll try to review in the interim.

@ghost ghost assigned jtratner Aug 27, 2013
@cancan101
Copy link
Contributor Author

This goes back to the question about xlrd dependence. I think my commit is still valid, but do we want to add an additional parameter to ExcelFile, something like load_merged_cells that is not specific to xlrd, which formatting_info=True is.

That way if xlrd is dropped as the Excel file reader, load_merged_cells can be mapped to the new library?

@cancan101
Copy link
Contributor Author

Or should formatting_info=True be on by default given the plans to support MultiIndex?

@cancan101
Copy link
Contributor Author

To be honest, we could probably hide this entire parameter from the user. If we see a need to load merged cells, such as when header is set to a be an iterable of length > 1, we could pass formatting_info=True to xlrd and then use get_effective_cell (see: #4673) to read the cells.

Thoughts?

@ghost ghost assigned ghost and jtratner Aug 28, 2013
@jtratner
Copy link
Contributor

Couldn't the test case for this just check that merged cells are produced? Does the content matter (seems like it's just what xlrd does)?

@cancan101
Copy link
Contributor Author

@jtratner I am not sure what you mean.

@jtratner
Copy link
Contributor

It's fine to pass keywords through (and that's reasonable), but 'getting the value of merged cells' is probably not something pandas should be set to do. At that point, you could just use xlrd directly.

@jtratner
Copy link
Contributor

To be clear I mean: converting merged cells into something reasonable in the outputted dataframe makes sense. It doesn't make sense to add something to the public api targeted specifically at merged cells.

@cancan101
Copy link
Contributor Author

@jtratner I am still not sure what you mean. Maybe I was unclear with my comment: I was just saying there might be a smart way of allowing us to parse merged cells in the ExcelParser without the user having to specify formatting_info=True

@cancan101
Copy link
Contributor Author

@jtratner I think we are in agreement. When I said "hide the parameter" what I should have said was not "not add the parameter to the public API"

merged_cells = sheet.merged_cells

self.assertEquals(len(merged_cells), 1)
rlo, rhi, clo, chi = merged_cells[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to test the actual parameters? merged_cells won't even be on the object if you don't pass formatting_info=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an object, the length is just =0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's close this PR. Just create a new one that tests that you can pass an xlrd workbook to the xlrd reader and that it all works. Then none of these tests will be necessary.

@jtratner
Copy link
Contributor

@cancan101 a few final things to get this accepted:

  1. please revise your test case to be shorter (i.e., just check that merged_cells is there and has length 1 - the rest is just an unnecessary smoke test for xlrd).
  2. Please add a note to release.rst documenting what you've changed
  3. Add a note to the docstring for ExcelFile that specifies that kwds are passed directly to the xlrd constructor. I think you should also note that it's not guaranteed that pandas will continue to use xlrd in the future so this is discouraged. - not sure - what do you think @jreback @cpcloud ?
  4. Please squash this down to 1 commit.

@jreback
Copy link
Contributor

jreback commented Aug 30, 2013

do u think have a flavor (or maybe engine?) argument is useful here (to specify the backend) , eg would be 'xlrd' in this case (so we can offer back compat at some point if needed), or overkill ?

@jtratner
Copy link
Contributor

@jreback I think there are already PRs for using other readers/writers for Excel, so not necessarily a bad thing to have. I just don't want to be tied to supporting any keyword arguments that aren't part of the public API.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@jtratner is this obsolete? (as in your new framework already does this)?

@jtratner
Copy link
Contributor

no, still up for grabs as to what to do (didn't refactor ExcelFile)

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

gr8, then is PR look ok?

@cancan101
Copy link
Contributor Author

@jtratner What is the best way to make this future proof to changes in the Excel parser (ie moving away from xlrd)?

@jtratner
Copy link
Contributor

@cancan101 I think it would be better to just allow read_excel to accept workbooks too. That way, you can do whatever you want to the table first and there's no requirement that read_excel itself expose things that aren't directly necessary for creating pandas objects.

@cancan101
Copy link
Contributor Author

@jtratner As a slight modification to your suggestion:

Change ExcelFile to accept a workbook for the path_or_buf argument. That will also mean that read_excel accepts workbooks too.

@jtratner
Copy link
Contributor

yep, that's what I was thinking.

@jtratner
Copy link
Contributor

Closing in favor of adding an option to let ExcelFile take an existing workbook.

@jtratner jtratner closed this Sep 24, 2013
@cancan101
Copy link
Contributor Author

@jtratner Can you leave this PR open? I was about to push fix / change to allow passing in sheet.

@jtratner
Copy link
Contributor

you should create a new PR for that, the discussion on this was about passing kwargs...confusing to do both. Plus, there was the additional discussion about the test cases.

@jtratner
Copy link
Contributor

btw - this is why it's good to do topic branches instead of directly on your own master. Lets you have multiple PRs open.

@cancan101
Copy link
Contributor Author

Do you mind if I leave a test case that tests both:

    def test_excel_read_merged_cells(self):
        _skip_if_no_xlrd()

        import xlrd

        pth = os.path.join(self.dirpath, 'merged.xls')
        xls = ExcelFile(xlrd.open_workbook(pth, formatting_info=True))        
        book = xls.book
        sheet = book.sheet_by_index(0)
        merged_cells = sheet.merged_cells

        self.assertEquals(len(merged_cells), 1)

@cancan101
Copy link
Contributor Author

@jtratner Yea, using master was a mistake. I usually use topic branches. Once pushed, github does not allow changing branch for PR.

@jtratner
Copy link
Contributor

I don't think that's a good test case - that doesn't really tell you anything different than the following:

pth = os.path.join(self.dirpath, 'merged.xls')
book = xlrd.open_workbook(pth)
xls = ExcelFile(book)        
self.assert_(book is xls.book)

The merged cells are available because the user created the workbook instance. You should add a parse check to make sure everything works (as well as sheet name lookup, etc.)

@cancan101
Copy link
Contributor Author

Okay. Given that the user is now using xlrd directly to create the work book, I guess it does not make sense to bother testing the merged cell functionality; we can assume that xlrd handles that correctly.

@cancan101
Copy link
Contributor Author

@jtratner Is there one of the current Excel tests that you think would be best to duplicate and have run using an ExcelFile opened from a book?

@jtratner
Copy link
Contributor

can't just do something like:

with ensure_clean(pth) as pth:
    df.to_excel(pth, "SheetA")
    book = xlrd.open_workbook(pth)
    xl = ExcelFile(book)
    result = xl.parse("SheetA")
    assert_frame_equal(df, result)

Where, for now, df is something with floats...

@cancan101
Copy link
Contributor Author

Created #4961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

formatting_info is not passed to xlrd.open_workbook.when using ExcelFile
4 participants