Skip to content

ENH Factored out excel_value_to_python_value from ExcelFile::_parse_exce... #4590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 26, 2013

Conversation

cancan101
Copy link
Contributor

...l (GH4589)

closes #4589

@cancan101
Copy link
Contributor Author

@jreback Should be good to merge

@jtratner
Copy link
Contributor

Can you add a short rationale for why you are doing this to the description text above? The issue you link doesn't explain it either.

@jreback jreback merged commit 9fcd30b into pandas-dev:master Aug 26, 2013
@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

thank you!

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@jtratner was too gung ho on this....should we take this out?

@cancan101
Copy link
Contributor Author

The rationale is that there are some ExcelFiles that parse is not able to parse and that I have to parse by hand iterating over the file (for example the file has odd structure). In those cases I do want to be able to use the logic in the excel_value_to_python_value function.

@jtratner
Copy link
Contributor

@jreback well, it might have been a bit premature. It doesn't have any documentation right now and it's not labeled as private (i.e., with an _ starting it), but was factored out from a private function (_parse_excel). I don't really think it makes sense to support this as part of the pandas excel-parsing API, given that we might want to cythonize it, make it vector-based, move it elsewhere, separate out date, boolean and empty-cell handling, etc. Plus, it only works if you're using xlrd, so it's misnamed.

@cancan101 at the very least, this should have a docstring that follows the numpy conventions for labelling, plus a couple of short examples (that, preferably, would pass if run as doctests with xlrd installed).

This also should have some kind of performance check (since, by default, it's adding a function call for every single individual cell in an imported sheet, as opposed to the previous inlined version). My guess is that overhead would be non-trivial for an Excel sheet with, for example, 500K rows.

@cancan101
Copy link
Contributor Author

@jtratner I am new to the pandas project but In general I like to break up functions to small enough pieces that are:

  1. More easily testable. The current _parse_excel has a lot going on and it makes it hard to test.
  2. More easily re-usable elsewhere either in pandas it self or by others who may want the functionality buried within the _parse_excel method.
    2a) By having others using the function, there is more incentive to 1) bug fix 2) improve performance 3) add feature where appropriate.

That being said, I do understand the purpose of avoiding function calls. However, I would hate to be prematurely optimizing at the expense of the above.

I also understand not wanting to tie us to xlrd I can certainly make that more clear with adding xlrd to the method name. I don't buy the cythonizing argument. An API is an API and we can make it work even with cython.

An observation I have found in a couple of places within pandas: there is some useful logic buried within a larger function or class. If I am not able to use the larger function (for example in this case) I have to choose between factoring out the code or copy and pasting in. The find the former preferable.

@cancan101
Copy link
Contributor Author

@jtratner @jreback Are there guidelines for "private" vs "non private" methods?

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@cancan101 well exposing as an API is a bit non-trivial, because have to make sure its 'good', because pandas has to live with it. So a judgement call. We don't in general just change something unless to fix a bug, enhance, or API change. Yes occasionally do cleaning/refactoring.

Keep in mind that your style may differ from pandas.

@jtratner
Copy link
Contributor

@cancan101 I understand your rationale - that said, some of this will boil down to a performance test (and Excel file loading is already really slow). Can you create a big Excel workbook (load in but not parse) and then time the differences in parsing time with the additional function call vs. without the additional function call? Good to test one that has no empty cells or dates and one with empty cells/dates (make sure parse_cols is enabled).

On (1) - even with the excel value parsing, the method is less than 50 lines long(not including the split out lines on the TextReader call). It's also factoring out the key thing that the function does (converting cells to appropriate python objects).

On (2a) - pandas is focused on providing methods for PandasObjects and I/O for serializing/deserializing those objects - this doesn't really make sense in that context, because it's not going to be used anywhere else (it's only for xlrd and only for parsing input data).

On public API - the more things we export, the more we have to keep track of if we want to change something later. The more minimal we can make the external API, the better (every public method we add is another method we are committing to keeping through at least the next version). Adding something to the public API also means we can't change its type signature (for example, if we wanted to pass a vector instead of a single cell). Right now, io/excel has only read_excel, ExcelFile, and ExcelWriter as its public API, with only a few public methods on the two classes.

If we were to refactor out this method, it should be a private method (i.e., named _excel_value_to_python) and not something we commit to maintaining from version to version. Also, this should probably be a method of ExcelFile instead of a separate function.

@cancan101
Copy link
Contributor Author

@jreback @jtratner

That make sense about maintaining the public API, especially since we do not want to be tied to xlrd.
I can see what I can do about testing on a large file. Do you have any sense for what the hot spot is in loading large xls files right now? Is it in I/O or CPU? If it is CPU, is it the pandas lib or the xlrd lib?

As far as methods for PandasObjects, I am loading the data in a DataFrame, and I do hope to be able to add enough feature (see my other PRs with merged cells) to Pandas to handle the specific files I have in mind, but for the time being the current XLS loader isn't quite flexible enough to do so. That being said It would be shame if I had to copy and paste all of the useful code in the time being.

As far as making the new method an instance method, I thought of doing so but since the method did not use any instance variables I chose to make it separate method. I can make it either an instance or class method if you think that makes sense. I can also make it private.

@jtratner
Copy link
Contributor

@cancan101 well, you can profile loading an excel file and see where the bottleneck is. That said, my experience is that the excel readers are slow and memory-heavy in and of themselves, and pandas just adds on top of that.


Now that I understand you're having issues with the pandas Excel file reader. I think a better approach than these separate PRs (like #4673) with refactored functions is for you to post/identify excel files that currently don't load well, determine what they should end up as, and then work to change pandas' Excel reader to accommodate those cases. (so, instead of adding a get_effective_cell method that you would use separately, you could contribute a PR that intelligently handles merged cells in Excel file input [e.g., by making them into a MultiIndex]). In short, instead of this PR, can you create an issue with the Excel file that isn't working right, and then work to make that load in?

At this point I think it makes sense for @jreback to revert this change, which can be incorporated into pandas later as part of a more focused PR that fixes the issues with the excel file reader that you are experiencing. You might even figure out a way to handle your files without needing to use this piece of code manually at all.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

reverted via 4f46814

@cancan101
Copy link
Contributor Author

Starting to outline some of the limitations of the Excel parser: #4679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Factor out excel_value_to_python_value from ExcelFile::_parse_excel
3 participants