@@ -344,3 +344,112 @@ where the data copying occurs.
344
344
345
345
See `this link <http://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe >`__
346
346
for more information.
347
+
348
+ .. _html-gotchas :
349
+
350
+ HTML Table Parsing
351
+ ------------------
352
+ There are some versioning issues surrounding the libraries that are used to
353
+ parse HTML tables in the top-level pandas io function ``read_html ``.
354
+
355
+ **Issues with ** |lxml |_
356
+
357
+ * Benefits
358
+
359
+ * |lxml |_ is very fast
360
+
361
+ * |lxml |_ requires Cython to install correctly.
362
+
363
+ * Drawbacks
364
+
365
+ * |lxml |_ does *not * make any guarantees about the results of it's parse
366
+ *unless * it is given |svm |_.
367
+
368
+ * In light of the above, we have chosen to allow you, the user, to use the
369
+ |lxml |_ backend, but **this backend will use ** |html5lib |_ if |lxml |_
370
+ fails to parse
371
+
372
+ * It is therefore *highly recommended * that you install both
373
+ |BeautifulSoup4 |_ and |html5lib |_, so that you will still get a valid
374
+ result (provided everything else is valid) even if |lxml |_ fails.
375
+
376
+ **Issues with ** |BeautifulSoup4 |_ **using ** |lxml |_ **as a backend **
377
+
378
+ * The above issues hold here as well since |BeautifulSoup4 |_ is essentially
379
+ just a wrapper around a parser backend.
380
+
381
+ **Issues with ** |BeautifulSoup4 |_ **using ** |html5lib |_ **as a backend **
382
+
383
+ * Benefits
384
+
385
+ * |html5lib |_ is far more lenient than |lxml |_ and consequently deals
386
+ with *real-life markup * in a much saner way rather than just, e.g.,
387
+ dropping an element without notifying you.
388
+
389
+ * |html5lib |_ *generates valid HTML5 markup from invalid markup
390
+ automatically *. This is extremely important for parsing HTML tables,
391
+ since it guarantees a valid document. However, that does NOT mean that
392
+ it is "correct", since the process of fixing markup does not have a
393
+ single definition.
394
+
395
+ * |html5lib |_ is pure Python and requires no additional build steps beyond
396
+ its own installation.
397
+
398
+ * Drawbacks
399
+
400
+ * The biggest drawback to using |html5lib |_ is that it is slow as
401
+ molasses. However consider the fact that many tables on the web are not
402
+ big enough for the parsing algorithm runtime to matter. It is more
403
+ likely that the bottleneck will be in the process of reading the raw
404
+ text from the url over the web, i.e., IO (input-output). For very large
405
+ tables, this might not be true.
406
+
407
+ **Issues with using ** |Anaconda |_
408
+
409
+ * `Anaconda `_ ships with `lxml `_ version 3.2.0; the following workaround for
410
+ `Anaconda `_ was successfully used to deal with the versioning issues
411
+ surrounding `lxml `_ and `BeautifulSoup4 `_.
412
+
413
+ .. note ::
414
+
415
+ Unless you have *both *:
416
+
417
+ * A strong restriction on the upper bound of the runtime of some code
418
+ that incorporates :func: `~pandas.io.html.read_html `
419
+ * Complete knowledge that the HTML you will be parsing will be 100%
420
+ valid at all times
421
+
422
+ then you should install `html5lib `_ and things will work swimmingly
423
+ without you having to muck around with `conda `. If you want the best of
424
+ both worlds then install both `html5lib `_ and `lxml `_. If you do install
425
+ `lxml `_ then you need to perform the following commands to ensure that
426
+ lxml will work correctly:
427
+
428
+ .. code-block :: sh
429
+
430
+ # remove the included version
431
+ conda remove lxml
432
+
433
+ # install the latest version of lxml
434
+ pip install ' git+git://github.com/lxml/lxml.git'
435
+
436
+ # install the latest version of beautifulsoup4
437
+ pip install ' bzr+lp:beautifulsoup'
438
+
439
+ Note that you need `bzr <http://bazaar.canonical.com/en >`_ and `git
440
+ <http://git-scm.com> `_ installed to perform the last two operations.
441
+
442
+ .. |svm | replace :: **strictly valid markup **
443
+ .. _svm : http://validator.w3.org/docs/help.html#validation_basics
444
+
445
+ .. |html5lib | replace :: **html5lib **
446
+ .. _html5lib : https://github.com/html5lib/html5lib-python
447
+
448
+ .. |BeautifulSoup4 | replace :: **BeautifulSoup4 **
449
+ .. _BeautifulSoup4 : http://www.crummy.com/software/BeautifulSoup
450
+
451
+ .. |lxml | replace :: **lxml **
452
+ .. _lxml : http://lxml.de
453
+
454
+ .. |Anaconda | replace :: **Anaconda **
455
+ .. _Anaconda : https://store.continuum.io/cshop/anaconda
0 commit comments