Skip to content

Commit d38ecc3

Browse files
committed
DOC: document the various pitfalls of reading html
DOC: change formatting DOC: more formatting DOC: add bold substitutions DOC: fill out bold links and rephrase DOC: fill out link to gotchas DOC: add gigantic install.rst warning DOC: move version note about html5lib to first mention of it DOC: add same to readme and add boto to install.rst DOC: add anaconda note DOC: add note about debian based system installation DOC: add correct lexer for pygments formatting of code snippets DOC: move boto up
1 parent 52af030 commit d38ecc3

File tree

4 files changed

+193
-15
lines changed

4 files changed

+193
-15
lines changed

README.rst

+39-8
Original file line numberDiff line numberDiff line change
@@ -93,18 +93,49 @@ Optional dependencies
9393
- openpyxl version 1.6.1 or higher, for writing .xlsx files
9494
- xlrd >= 0.9.0
9595
- Needed for Excel I/O
96-
- Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
97-
`Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
98-
reading HTML tables
96+
- `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3
97+
access.
98+
- One of the following combinations of libraries is needed to use the
99+
top-level :func:`~pandas.io.html.read_html` function:
100+
101+
- `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
102+
okay.)
103+
- `BeautifulSoup4`_ and `lxml`_
104+
- `BeautifulSoup4`_ and `html5lib`_ and `lxml`_
105+
- Only `lxml`_, although see :ref:`HTML reading gotchas <html-gotchas>`
106+
for reasons as to why you should probably **not** take this approach.
99107

100108
.. warning::
101109

102-
You need to install an older version of Beautiful Soup:
103-
- Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian
104-
- Version 4.0.2 have been confirmed for 32-bit Ubuntu
110+
- if you install `BeautifulSoup4`_ you must install either
111+
`lxml`_ or `html5lib`_ or both.
112+
:func:`~pandas.io.html.read_html` will **not** work with *only*
113+
`BeautifulSoup4`_ installed.
114+
- You are highly encouraged to read :ref:`HTML reading gotchas
115+
<html-gotchas>`. It explains issues surrounding the installation and
116+
usage of the above three libraries
117+
- You may need to install an older version of `BeautifulSoup4`_:
118+
- Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and
119+
32-bit Ubuntu/Debian
120+
- Additionally, if you're using `Anaconda`_ you should definitely
121+
read :ref:`the gotchas about HTML parsing libraries <html-gotchas>`
105122

106-
- Any recent version of ``html5lib`` is okay.
107-
- `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
123+
.. note::
124+
125+
- if you're on a system with ``apt-get`` you can do
126+
127+
.. code-block:: sh
128+
129+
sudo apt-get build-dep python-lxml
130+
131+
to get the necessary dependencies for installation of `lxml`_. This
132+
will prevent further headaches down the line.
133+
134+
135+
.. _html5lib: https://github.com/html5lib/html5lib-python
136+
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
137+
.. _lxml: http://lxml.de
138+
.. _Anaconda: https://store.continuum.io/cshop/anaconda
108139

109140

110141
Installation from sources

doc/source/gotchas.rst

+109
Original file line numberDiff line numberDiff line change
@@ -344,3 +344,112 @@ where the data copying occurs.
344344

345345
See `this link <http://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe>`__
346346
for more information.
347+
348+
.. _html-gotchas:
349+
350+
HTML Table Parsing
351+
------------------
352+
There are some versioning issues surrounding the libraries that are used to
353+
parse HTML tables in the top-level pandas io function ``read_html``.
354+
355+
**Issues with** |lxml|_
356+
357+
* Benefits
358+
359+
* |lxml|_ is very fast
360+
361+
* |lxml|_ requires Cython to install correctly.
362+
363+
* Drawbacks
364+
365+
* |lxml|_ does *not* make any guarantees about the results of it's parse
366+
*unless* it is given |svm|_.
367+
368+
* In light of the above, we have chosen to allow you, the user, to use the
369+
|lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_
370+
fails to parse
371+
372+
* It is therefore *highly recommended* that you install both
373+
|BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid
374+
result (provided everything else is valid) even if |lxml|_ fails.
375+
376+
**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend**
377+
378+
* The above issues hold here as well since |BeautifulSoup4|_ is essentially
379+
just a wrapper around a parser backend.
380+
381+
**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend**
382+
383+
* Benefits
384+
385+
* |html5lib|_ is far more lenient than |lxml|_ and consequently deals
386+
with *real-life markup* in a much saner way rather than just, e.g.,
387+
dropping an element without notifying you.
388+
389+
* |html5lib|_ *generates valid HTML5 markup from invalid markup
390+
automatically*. This is extremely important for parsing HTML tables,
391+
since it guarantees a valid document. However, that does NOT mean that
392+
it is "correct", since the process of fixing markup does not have a
393+
single definition.
394+
395+
* |html5lib|_ is pure Python and requires no additional build steps beyond
396+
its own installation.
397+
398+
* Drawbacks
399+
400+
* The biggest drawback to using |html5lib|_ is that it is slow as
401+
molasses. However consider the fact that many tables on the web are not
402+
big enough for the parsing algorithm runtime to matter. It is more
403+
likely that the bottleneck will be in the process of reading the raw
404+
text from the url over the web, i.e., IO (input-output). For very large
405+
tables, this might not be true.
406+
407+
**Issues with using** |Anaconda|_
408+
409+
* `Anaconda`_ ships with `lxml`_ version 3.2.0; the following workaround for
410+
`Anaconda`_ was successfully used to deal with the versioning issues
411+
surrounding `lxml`_ and `BeautifulSoup4`_.
412+
413+
.. note::
414+
415+
Unless you have *both*:
416+
417+
* A strong restriction on the upper bound of the runtime of some code
418+
that incorporates :func:`~pandas.io.html.read_html`
419+
* Complete knowledge that the HTML you will be parsing will be 100%
420+
valid at all times
421+
422+
then you should install `html5lib`_ and things will work swimmingly
423+
without you having to muck around with `conda`. If you want the best of
424+
both worlds then install both `html5lib`_ and `lxml`_. If you do install
425+
`lxml`_ then you need to perform the following commands to ensure that
426+
lxml will work correctly:
427+
428+
.. code-block:: sh
429+
430+
# remove the included version
431+
conda remove lxml
432+
433+
# install the latest version of lxml
434+
pip install 'git+git://github.com/lxml/lxml.git'
435+
436+
# install the latest version of beautifulsoup4
437+
pip install 'bzr+lp:beautifulsoup'
438+
439+
Note that you need `bzr <http://bazaar.canonical.com/en>`_ and `git
440+
<http://git-scm.com>`_ installed to perform the last two operations.
441+
442+
.. |svm| replace:: **strictly valid markup**
443+
.. _svm: http://validator.w3.org/docs/help.html#validation_basics
444+
445+
.. |html5lib| replace:: **html5lib**
446+
.. _html5lib: https://github.com/html5lib/html5lib-python
447+
448+
.. |BeautifulSoup4| replace:: **BeautifulSoup4**
449+
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
450+
451+
.. |lxml| replace:: **lxml**
452+
.. _lxml: http://lxml.de
453+
454+
.. |Anaconda| replace:: **Anaconda**
455+
.. _Anaconda: https://store.continuum.io/cshop/anaconda

doc/source/install.rst

+39-7
Original file line numberDiff line numberDiff line change
@@ -102,17 +102,49 @@ Optional Dependencies
102102
* `openpyxl <http://packages.python.org/openpyxl/>`__, `xlrd/xlwt <http://www.python-excel.org/>`__
103103
* openpyxl version 1.6.1 or higher
104104
* Needed for Excel I/O
105-
* Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
106-
`Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
107-
reading HTML tables
105+
* `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3
106+
access.
107+
* One of the following combinations of libraries is needed to use the
108+
top-level :func:`~pandas.io.html.read_html` function:
109+
110+
* `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
111+
okay.)
112+
* `BeautifulSoup4`_ and `lxml`_
113+
* `BeautifulSoup4`_ and `html5lib`_ and `lxml`_
114+
* Only `lxml`_, although see :ref:`HTML reading gotchas <html-gotchas>`
115+
for reasons as to why you should probably **not** take this approach.
108116

109117
.. warning::
110118

111-
You need to install an older version of Beautiful Soup:
112-
- Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian
113-
- Version 4.0.2 have been confirmed for 32-bit Ubuntu
119+
* if you install `BeautifulSoup4`_ you must install either
120+
`lxml`_ or `html5lib`_ or both.
121+
:func:`~pandas.io.html.read_html` will **not** work with *only*
122+
`BeautifulSoup4`_ installed.
123+
* You are highly encouraged to read :ref:`HTML reading gotchas
124+
<html-gotchas>`. It explains issues surrounding the installation and
125+
usage of the above three libraries
126+
* You may need to install an older version of `BeautifulSoup4`_:
127+
- Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and
128+
32-bit Ubuntu/Debian
129+
* Additionally, if you're using `Anaconda`_ you should definitely
130+
read :ref:`the gotchas about HTML parsing libraries <html-gotchas>`
114131

115-
* Any recent version of ``html5lib`` is okay.
132+
.. note::
133+
134+
* if you're on a system with ``apt-get`` you can do
135+
136+
.. code-block:: sh
137+
138+
sudo apt-get build-dep python-lxml
139+
140+
to get the necessary dependencies for installation of `lxml`_. This
141+
will prevent further headaches down the line.
142+
143+
144+
.. _html5lib: https://github.com/html5lib/html5lib-python
145+
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
146+
.. _lxml: http://lxml.de
147+
.. _Anaconda: https://store.continuum.io/cshop/anaconda
116148

117149
.. note::
118150

doc/source/io.rst

+6
Original file line numberDiff line numberDiff line change
@@ -943,6 +943,12 @@ HTML
943943
Reading HTML Content
944944
~~~~~~~~~~~~~~~~~~~~~~
945945

946+
.. warning::
947+
948+
We **highly encourage** you to read the :ref:`HTML parsing gotchas
949+
<html-gotchas>` regarding the issues surrounding the
950+
BeautifulSoup4/html5lib/lxml parsers.
951+
946952
.. _io.read_html:
947953

948954
.. versionadded:: 0.11.1

0 commit comments

Comments
 (0)