-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Suggestions for html table parsing #7220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@klonuo Thanks for the comments. For 2, I think that's a fairly simple fix, and if you want to submit a PR that would be most welcome. Otherwise I can put something up for v0.14.1. For 1, we chose |
@cpcloud Thanks for your prompt reply I think it's best if you could provide fix, as I'm not familiar with pandas coding rules or internals. I just browsed html module, as I was getting errors, while parsing manually with lxml didn't have problems. As for 1, I reported what I find unfortunate. Again I don't know libxml2 internals, but from what I read recoverer should be pretty advanced feature. IMHO disabling it shouldn't slow the parsing, nor enabling it should drop valuable data, but lets assume you did tests before applying it. Still missing closing paragraph tag wont let me see the table ;) |
Yep we test everything and that is in fact how I discovered data were being dropped. I compared it with the results of the html5lib parser and the latter handles invalid HTML in a saner way than lxml. It was a conscious data-driven decision to have it behave this way. |
OK, your call. Maybe having it optional, defaulting to False, wont introduce too many arguments, while allowing users an option to parse non strict html instead failing. |
@klonuo Fair enough, simple tables don't work. Thanks for the report. |
|
Hi,
I have lxml, and when trying to parse html with tables through pandas I found some problems:
recover=False
contrary to lxml's default value. This is unfortunate because parser will fail on any problem in html page, like missing closing tag or similar, which IMHO happens too often to be justifiable. If such strict rules should be considered, I'd suggest parsing html document with default parser values, and then apply this restriction only on table fragments.pandas.io.html.read_html()
and lxml doesn't do magic unless http-equiv attribute is correctly declared in html document, or encoding argument is passed to HTMLParser.The text was updated successfully, but these errors were encountered: