-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Windows failures are related, will be investigating. |
pandas/io/excel/_base.py
Outdated
else: | ||
peek = buf | ||
except FileNotFoundError: | ||
# File may be a url, return the extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why provide degraded inference just because the source is a URL?
It appears this code goes out of its way to avoid using seek, whereas the ODS inference code that was there before, and xlrd which is being hit in many of the existing code paths already does seek without issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous ODS code read at most 84 bytes; current code needs the entire file. I'll do a partial revert here and utilize the previous ODS code, but I have reservations about downloading the entire file here. Would like to hear others' thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah; I think I have falsely assumed we need to get the entire contents of the file. I think it should be possible to get BufferedIOBase/RawIOBase into the proper form for ZipFile without reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did put quite a lot of effort into #38424; the code in that PR is the way it is from, as carefully as I could, following through the various code paths and ensuring the behaviour was as simple and robust as it could be, in spite of less automated testing than I was expecting. It's tough to see these kind of comments which sort of imply that I hadn't thought any of this through...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In no way shape or form did I have any intention of implying such a thing. As my comment above says, I was mistaken.
I believe this would also replace #38456. I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2. This PR has code that is more complex than #38522, even ignoring that which is in place to support the convoluted deprecation process it advocates for. I haven't seen code coverage metrics as part of the pandas PR process, but are you sure all the new conditional branches you've introduced are covered by sufficient tests? In case it wasn't clear in #38522, I believe the most robust approach in this area is to:
|
Thanks for the comments @cjw296, I was able to simplify/improve a lot from them. |
@rhshadrach - I'm confused as to why you didn't just start with #38522 and add the fallback logic you're advocating for. (corrected) |
2 linux tests failed to start, all other checks passed. Rerunning. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Update: Failure if from one of the builds using zh_CN.utf8 Failure is related, though I don't understand it:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR) |
ok sure. let's try to get this in today and cut the release. i view better inference as good but shouldn't hold this up. |
@jreback @jorisvandenbossche Is there room for improving the inference done here? As far as I can tell, the inference done here is that of #38522 but also handles bytes and urls (which I believe is causing failures in #38522). |
Two of the travis builds failed to start, the third one passed. Restarted manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor comment. pls commetn / update
@jreback - Agreed, it is, but I don't understand it. The warning is emitted when openpyxl is not installed but xlrd < 2.0 is and a non-xls file is used. Replicating this in my environment, the test passes with the warning emitted with the right stacklevel (checked manually). So maybe this is a Windows vs Linux issue? What really doesn't make sense is that Attempting to debug via CI now. |
If it only fails on windows, I think it is also fine to add a |
Will do, is this a known issue with Windows? Couldn't find an issue on github, will add one if it is. |
Actually, it's a test asserting warnings about positional arguments, not about the engine. So that might interfere with the test, because it now also raises another warning than the one it is testing? |
@jorisvandenbossche Yes, that appears to me to be the reason why it's failing now. However, it passes locally for me and stepping through the assert code, I don't see any reason why it might not work on Windows. But most likely I'm missing some peculiarity about windows.. |
I think it is certainly fine to skip the check for stacklevel here (or mayble only skip it on windows, so it's still tested generally) |
@rhshadrach sorry, I merged #38456, which will give some merge conflict pains. I can also deal with it if you like |
…rd_warnings � Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/io/excel/_base.py
@jorisvandenbossche - Not a problem. The merge was fairly straightforward, but your review would be appreciated to make sure I didn't mess any of it up. (I am also double checking it now) The stacklevel issue was a bug in one of my previous PRs where the file path was hardcoded using '/'. Should be fixed now. Also your requested test has been added. |
Doc changes look good after the merge! |
|
||
def test_corrupt_bytes_raises(self, read_ext, engine): | ||
bad_stream = b"foo" | ||
with pytest.raises(BadZipFile, match="File is not a zip file"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be left for a follow-up (time to get this merged ;-)), but this is not a super clear error message in case you accidentally pass a non-excel file to read_excel
We should probably check in the inspect function if it is a zip file (like is also done here: https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R942), and if not raise an error like "file is not recognized as an excel file" or so.
@rhshadrach unfortunately a couple of legit looking failures. note happy to xfail those tests for those versions is fine. |
@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test. |
excellent, ping on green-ish |
thanks @rhshadrach amazing job here. and thank you @cjw296 for your PR and the input. |
@meeseeksdev backport 1.2.x |
…38670) Co-authored-by: Richard Shadrach <[email protected]>
Thanks @rhshadrach ! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Alternative to #38522. I've been testing this locally using both xlrd 1.2.0 and 2.0.1.
One test fails because we used to default to xlrd but now default to openpyxl, it's not clear to me if this test should be passing with openpyxl.
cc @cjw296, @jreback, @jorisvandenbossche