-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH Factored out excel_value_to_python_value from ExcelFile::_parse_exce... #4590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jreback Should be good to merge |
Can you add a short rationale for why you are doing this to the description text above? The issue you link doesn't explain it either. |
thank you! |
@jtratner was too gung ho on this....should we take this out? |
The rationale is that there are some |
@jreback well, it might have been a bit premature. It doesn't have any documentation right now and it's not labeled as private (i.e., with an @cancan101 at the very least, this should have a docstring that follows the numpy conventions for labelling, plus a couple of short examples (that, preferably, would pass if run as doctests with xlrd installed). This also should have some kind of performance check (since, by default, it's adding a function call for every single individual cell in an imported sheet, as opposed to the previous inlined version). My guess is that overhead would be non-trivial for an Excel sheet with, for example, 500K rows. |
@jtratner I am new to the pandas project but In general I like to break up functions to small enough pieces that are:
That being said, I do understand the purpose of avoiding function calls. However, I would hate to be prematurely optimizing at the expense of the above. I also understand not wanting to tie us to An observation I have found in a couple of places within pandas: there is some useful logic buried within a larger function or class. If I am not able to use the larger function (for example in this case) I have to choose between factoring out the code or copy and pasting in. The find the former preferable. |
@cancan101 well exposing as an API is a bit non-trivial, because have to make sure its 'good', because pandas has to live with it. So a judgement call. We don't in general just change something unless to fix a bug, enhance, or API change. Yes occasionally do cleaning/refactoring. Keep in mind that your style may differ from pandas. |
@cancan101 I understand your rationale - that said, some of this will boil down to a performance test (and Excel file loading is already really slow). Can you create a big Excel workbook (load in but not parse) and then time the differences in parsing time with the additional function call vs. without the additional function call? Good to test one that has no empty cells or dates and one with empty cells/dates (make sure parse_cols is enabled). On (1) - even with the excel value parsing, the method is less than 50 lines long(not including the split out lines on the TextReader call). It's also factoring out the key thing that the function does (converting cells to appropriate python objects). On (2a) - pandas is focused on providing methods for On public API - the more things we export, the more we have to keep track of if we want to change something later. The more minimal we can make the external API, the better (every public method we add is another method we are committing to keeping through at least the next version). Adding something to the public API also means we can't change its type signature (for example, if we wanted to pass a vector instead of a single cell). Right now, io/excel has only If we were to refactor out this method, it should be a private method (i.e., named |
That make sense about maintaining the public API, especially since we do not want to be tied to As far as methods for As far as making the new method an instance method, I thought of doing so but since the method did not use any instance variables I chose to make it separate method. I can make it either an instance or class method if you think that makes sense. I can also make it private. |
@cancan101 well, you can profile loading an excel file and see where the bottleneck is. That said, my experience is that the excel readers are slow and memory-heavy in and of themselves, and pandas just adds on top of that. Now that I understand you're having issues with the pandas Excel file reader. I think a better approach than these separate PRs (like #4673) with refactored functions is for you to post/identify excel files that currently don't load well, determine what they should end up as, and then work to change pandas' Excel reader to accommodate those cases. (so, instead of adding a At this point I think it makes sense for @jreback to revert this change, which can be incorporated into pandas later as part of a more focused PR that fixes the issues with the excel file reader that you are experiencing. You might even figure out a way to handle your files without needing to use this piece of code manually at all. |
reverted via 4f46814 |
Starting to outline some of the limitations of the Excel parser: #4679 |
...l (GH4589)
closes #4589