-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_html() performance when HTML malformed #14312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
the HTML parser defaults to the python parser. you could try switching it to the c engine and see if that works. I believe that you can specify it to use ONLY the lxml engine (rather than fallback). It might be acceptable to make that the default. If you can provide some stats on these types of changes would be great. |
Just to clarify - this is not about Python vs. C(ython) in I'm happy to provide any stats necessary. One way could be to try various HTML files from the interwebs and seeing if non-strict lxml produces the same DataFrame as bs4. (Unless there's already a rich test suite for this.) |
take a look at this: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#html-gotchas |
This was actually changed in #20293 to give lxml more power on the recovery side of things so I don't think this is relevant any longer |
I've tested this and where we saw a 30x slowdown ("pure" lxml vs. pandas), there is now a 2-5x slowdown, which can be attributed to constructing dataframes and other housekeeping. Thanks! |
When parsing thousands of HTML files, I noticed there was a rather large gap in performance when using
pd.read_html
and when parsing by hand usinglxml
- one that could not be attributed to thepandas
overhead in handling edge cases.I looked into it a little bit and found out
pandas
useslxml
, but only when there are no formatting issues (recover=False
in the parser settings, let's call it strict mode). It falls back onBeautifulSoup
in that case, which is an order of magnitude slower. I have recreated this performance test using two public websites and a dummy example of a generated table.https://gist.github.com/kokes/b97c8324ba664400714a78f5561340fc
(My code in no way replicates
pd.read_html
, but the gap is too big to be explained by edge case detection and proper column naming.)I would like to find out how
BeautifulSoup
improves uponlxml
in malformed HTML files, to justify the performance gap. This may as well be a non-issue iflxml
is known to produce incorrect outputs - you tell me. For example, in the wikipedia example (see the gist above),lxml
fails to compute in strict mode, because it reaches a<bdi>
element in one of the cells, which it does not recognise. There aren't any formatting errors, just an unknown element.(Or maybe there's a C/Cython implementation of
bs4
that could be used, I haven't explored that option, I'm still trying to understand these basics.)Thank you!
The text was updated successfully, but these errors were encountered: