-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Open Document Format ODS support in read_excel() #9070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As long as the tests are not implemented, you can quickly get into the action as follows: import pandas as pd
pd.read_excel('pandas/io/tests/data/test.ods')
pd.read_excel('pandas/io/tests/data/test2.ods')
pd.read_excel('pandas/io/tests/data/test3.ods')
pd.read_excel('pandas/io/tests/data/test_converters.ods')
pd.read_excel('pandas/io/tests/data/test_types.ods')
pd.read_excel('pandas/io/tests/data/test_types2.ods') |
@davidovitch Is there a good reason why you need this functionality in
For similar but not quite identical tests, my usual approach is to use multiple inheritance: http://stackoverflow.com/a/1323554/809705 |
@davidovitch well, this is using a lot of duplicated code. You can refactor to have the I don't think adding to the namespace is useful (e.g. |
@jreback, I've refactored it as suggested: the The entire logic to derive at run-time the file type is extended and lives in Only the last reader test fails at this stage, and I have some work left in figuring out why it goes wrong. The test that currently still fails also revealed a small typo in The ods writer is still on the todo list. |
elif ext in ['xls', 'xlsx', 'xlsm']: | ||
self.engine = 'xlrd' | ||
|
||
# required imports for the respective engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, lots of if/then here. Can you base on the engine return a class which does this (e.g. you create a class hierarchy of a BaseFile, EZDFFile, and XLRDFile, these last 2 are sub-classes of BaseFile). Which you then instantiate based on the engine). will be much cleaner.
also need to add ezodf to the build process. This is not available on conda. But can pip install it. I can fix that so this will tests correctly. Also would need an update to the install.rst docs (as this is now an alternative to xlrd). |
So, after a long pause I am trying to get this going again. I just did a rebase with upstream and pushed my latest changes to my remote clone here. However, I think I did something not entirely correct because now I can see some other commits appearing in this pull request... @jreback, i have made an attempt to make it less messy, but I have to admit I wasn't completely sure I understood your suggestions correctly. I guess you'll let me know if this is not exactly right :-) There is now another test failing, so this is not over yet. Anyway, thanks for reviewing this. It has been an interesting and learnfull ride so far. |
@davidovitch ok, first order of biz is to rebase this. Then can see what the changes. |
@jreback, I think git history is fixed now. All but the most recent commit were doubled previously. I have removed those commits with a rebase, followed by a push origin master -f to here. |
self.engine = 'xlrd' | ||
self.extensions = ['xls', 'xlsx', 'xlsm'] | ||
self.io_class = type(None) | ||
self.open_workbook = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pass all of these arguments to the BaseFile.__init__
and have it do the assignments
378e4a5
to
5da115e
Compare
@jreback, is this looking better? Regarding the tests for the readers (ExcelReaderTests), I added a method to determine which readers are installed, and only test for those. This is different then before where a test would skip if only one out of 3 readers was not installed. I think it looks less compact in the code, but should result in better test coverage when not all the spreadsheet readers are installed. |
Sure, let us know when you're ready On Wed, Apr 15, 2015 at 9:32 AM, David Verelst [email protected]
|
It is not for the lack of trying (you will probably not belief the amount of time I already spent on this PR), or for ignoring you comments, it is that I am failing to properly translate the comments into clean, Pythonic code that clears the quality checks of the review procedure. I agree with your assessment, but I can only conclude that I fail to boil it down into simple and clean logic. I have some more time today and tomorrow to work on this PR, lets see where we stand after that. |
@davidovitch no I believe you are trying hard! let's squash down to a single commit and rebase. I won't have time to play with this for a bit though. |
@jreback thanks for your patience. Too bad I couldn't make it for the 0.17 release, but at least I tried and learned something in the process :-) Before squashing everything into a single commit, what I think makes sense to do is the following:
If you think this it too much whatever, please say so and I will just squash into one commit and forgot about the whole thing, no hard feelings :-) @andresmrm, if you really want this badly, you can always give this fork a spin. The current implementation actually works, and passes all the tests (if you have a lot of ODS files hanging around you can test it more extensively and report back any bugs in this PR ;-) ). However, the implementation is not on par with the code quality everyone expects from a library as Pandas. |
@davidovitch Good luck and perseverance. |
…files, more consistant naming of files
@davidovitch ok on a refactored test pr (though don't include the ods files; just the new naming scheme) |
@jreback, the refactored tests (excluding ODS files and tests as requested) are in PR #10964. |
I am lost in all related PRs. What is the status of this feature (why is it closed)? I am willing to contribute somehow, but really don't know where to start. |
well, you can start with this PR and create a new one that addresses the comments. |
contributing guide is here |
@hnykda I would start with rebasing the fork again on top of the current pandas master. I've been planning to do this for a while, but I haven't got around it. Let me know if you manage/plan to do it so I don't need to that again ;-) Please feel free to dig in or ask any questions you have regarding the current implementation. |
Thanks. I unfortunately don't see that in the near future, probably after the new year. |
@hnykda I've rebased the ezodf_reader branch in my pandas fork with the pandas master in case you like to have a look at it. I might try to improve the current implementation in the near future. |
@davidovitch any plans of working on this or can I take it up? |
For what its worth I wrote a ODS reader that produces a pandas DataFrame with tests for the various ways my spreadsheets became hard to parse. I tried to stay close to the pandas _XlrdReader class |
@detrout that looks pretty good - seems way enough to adapt if you’d like to do a PR |
@jreback I hadn't done a PR because I hadn't bothered to implement a writer. Is it worth going ahead and submitting just the reader class? |
@detrout that is totally fine |
@detrout, that's really clean code. I was also thinking of going the |
@TrigonaMinima Ok I'll try to put together over the weekend. |
Quick question. odfpy uses ISO 8601 durations and one of my ods test cases has a timeduration, of PT03H45M00S. I had my own handrolled parser, because I wrote this a while ago. But pandas now has a iso8601 duration parser. Unfortunately it assumes that there will always be a day specifier and fails with PT03H45M00S, but does work with P0DT03H45M00S. So I filed #25422 Below is the quick workaround I came up with.
Is that acceptable for the moment? |
yes that’s ok but if u can do as a separate PR first |
My rough draft PR with an alternate implementation is here #25427 |
This work should eventually lead to a reader and writer for ods spreadsheets (LibreOffice / OpenOffice Open Document Format). I am doing this work in the master of my clone of pandas. From the wiki I understood it is fine for me to commit to my master and rebase with upstream from time to time. Please let me know if I should follow another git workflow (this is my first PR).
This is what I have so far:
I am not sure how to approach the following issues:
For the record, there is one message on the mailing list, and this PR will fix issue #2311.
ods support depends on ezodf, and issue #2311 mentions some of the alternatives.