Skip to content

read_html() performance when HTML malformed #14312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kokes opened this issue Sep 28, 2016 · 5 comments
Closed

read_html() performance when HTML malformed #14312

kokes opened this issue Sep 28, 2016 · 5 comments
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Performance Memory or execution speed performance

Comments

@kokes
Copy link
Contributor

kokes commented Sep 28, 2016

When parsing thousands of HTML files, I noticed there was a rather large gap in performance when using pd.read_html and when parsing by hand using lxml - one that could not be attributed to the pandas overhead in handling edge cases.

I looked into it a little bit and found out pandas uses lxml, but only when there are no formatting issues (recover=False in the parser settings, let's call it strict mode). It falls back on BeautifulSoup in that case, which is an order of magnitude slower. I have recreated this performance test using two public websites and a dummy example of a generated table.

https://gist.github.com/kokes/b97c8324ba664400714a78f5561340fc

(My code in no way replicates pd.read_html, but the gap is too big to be explained by edge case detection and proper column naming.)

I would like to find out how BeautifulSoup improves upon lxml in malformed HTML files, to justify the performance gap. This may as well be a non-issue if lxml is known to produce incorrect outputs - you tell me. For example, in the wikipedia example (see the gist above), lxml fails to compute in strict mode, because it reaches a <bdi> element in one of the cells, which it does not recognise. There aren't any formatting errors, just an unknown element.

(Or maybe there's a C/Cython implementation of bs4 that could be used, I haven't explored that option, I'm still trying to understand these basics.)

Thank you!

@jreback
Copy link
Contributor

jreback commented Sep 28, 2016

the HTML parser defaults to the python parser. you could try switching it to the c engine and see if that works. I believe that you can specify it to use ONLY the lxml engine (rather than fallback). It might be acceptable to make that the default. If you can provide some stats on these types of changes would be great.

@jreback jreback added Performance Memory or execution speed performance IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Sep 28, 2016
@kokes
Copy link
Contributor Author

kokes commented Sep 28, 2016

Just to clarify - this is not about Python vs. C(ython) in lxml, this is about lxml vs. bs4, where the latter is an order of magnitude slower, but gets used very frequently due to the strict mode setting in lxml (which is not the default for lxml, it was opted for).

I'm happy to provide any stats necessary. One way could be to try various HTML files from the interwebs and seeing if non-strict lxml produces the same DataFrame as bs4. (Unless there's already a rich test suite for this.)

@Amaelb
Copy link

Amaelb commented Nov 9, 2016

take a look at this:
#7220
it is a bit dated, would you say that lxml fare better now ?

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#html-gotchas

@WillAyd
Copy link
Member

WillAyd commented Dec 11, 2018

This was actually changed in #20293 to give lxml more power on the recovery side of things so I don't think this is relevant any longer

@WillAyd WillAyd closed this as completed Dec 11, 2018
@kokes
Copy link
Contributor Author

kokes commented Dec 12, 2018

I've tested this and where we saw a 30x slowdown ("pure" lxml vs. pandas), there is now a 2-5x slowdown, which can be attributed to constructing dataframes and other housekeeping.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants