@@ -2043,8 +2043,8 @@ Reading HTML Content
2043
2043
2044
2044
.. warning ::
2045
2045
2046
- We **highly encourage ** you to read the :ref: `HTML Table Parsing gotchas<gotchas .html> `
2047
- regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
2046
+ We **highly encourage ** you to read the :ref: `HTML Table Parsing gotchas<io .html.gotchas > `
2047
+ below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
2048
2048
2049
2049
.. versionadded :: 0.12.0
2050
2050
@@ -2346,6 +2346,83 @@ Not escaped:
2346
2346
Some browsers may not show a difference in the rendering of the previous two
2347
2347
HTML tables.
2348
2348
2349
+
2350
+ .. _io.html.gotchas :
2351
+
2352
+ HTML Table Parsing Gotchas
2353
+ ''''''''''''''''''''''''''
2354
+
2355
+ There are some versioning issues surrounding the libraries that are used to
2356
+ parse HTML tables in the top-level pandas io function ``read_html ``.
2357
+
2358
+ **Issues with ** |lxml |_
2359
+
2360
+ * Benefits
2361
+
2362
+ * |lxml |_ is very fast
2363
+
2364
+ * |lxml |_ requires Cython to install correctly.
2365
+
2366
+ * Drawbacks
2367
+
2368
+ * |lxml |_ does *not * make any guarantees about the results of its parse
2369
+ *unless * it is given |svm |_.
2370
+
2371
+ * In light of the above, we have chosen to allow you, the user, to use the
2372
+ |lxml |_ backend, but **this backend will use ** |html5lib |_ if |lxml |_
2373
+ fails to parse
2374
+
2375
+ * It is therefore *highly recommended * that you install both
2376
+ |BeautifulSoup4 |_ and |html5lib |_, so that you will still get a valid
2377
+ result (provided everything else is valid) even if |lxml |_ fails.
2378
+
2379
+ **Issues with ** |BeautifulSoup4 |_ **using ** |lxml |_ **as a backend **
2380
+
2381
+ * The above issues hold here as well since |BeautifulSoup4 |_ is essentially
2382
+ just a wrapper around a parser backend.
2383
+
2384
+ **Issues with ** |BeautifulSoup4 |_ **using ** |html5lib |_ **as a backend **
2385
+
2386
+ * Benefits
2387
+
2388
+ * |html5lib |_ is far more lenient than |lxml |_ and consequently deals
2389
+ with *real-life markup * in a much saner way rather than just, e.g.,
2390
+ dropping an element without notifying you.
2391
+
2392
+ * |html5lib |_ *generates valid HTML5 markup from invalid markup
2393
+ automatically *. This is extremely important for parsing HTML tables,
2394
+ since it guarantees a valid document. However, that does NOT mean that
2395
+ it is "correct", since the process of fixing markup does not have a
2396
+ single definition.
2397
+
2398
+ * |html5lib |_ is pure Python and requires no additional build steps beyond
2399
+ its own installation.
2400
+
2401
+ * Drawbacks
2402
+
2403
+ * The biggest drawback to using |html5lib |_ is that it is slow as
2404
+ molasses. However consider the fact that many tables on the web are not
2405
+ big enough for the parsing algorithm runtime to matter. It is more
2406
+ likely that the bottleneck will be in the process of reading the raw
2407
+ text from the URL over the web, i.e., IO (input-output). For very large
2408
+ tables, this might not be true.
2409
+
2410
+
2411
+ .. |svm | replace :: **strictly valid markup **
2412
+ .. _svm : http://validator.w3.org/docs/help.html#validation_basics
2413
+
2414
+ .. |html5lib | replace :: **html5lib **
2415
+ .. _html5lib : https://github.com/html5lib/html5lib-python
2416
+
2417
+ .. |BeautifulSoup4 | replace :: **BeautifulSoup4 **
2418
+ .. _BeautifulSoup4 : http://www.crummy.com/software/BeautifulSoup
2419
+
2420
+ .. |lxml | replace :: **lxml **
2421
+ .. _lxml : http://lxml.de
2422
+
2423
+
2424
+
2425
+
2349
2426
.. _io.excel :
2350
2427
2351
2428
Excel files
0 commit comments