Skip to content

DOC: document bs4/lxml/html5lib issues #3751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 39 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,18 +93,49 @@ Optional dependencies
- openpyxl version 1.6.1 or higher, for writing .xlsx files
- xlrd >= 0.9.0
- Needed for Excel I/O
- Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
`Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
reading HTML tables
- `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3
access.
- One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.io.html.read_html` function:

- `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
okay.)
- `BeautifulSoup4`_ and `lxml`_
- `BeautifulSoup4`_ and `html5lib`_ and `lxml`_
- Only `lxml`_, although see :ref:`HTML reading gotchas <html-gotchas>`
for reasons as to why you should probably **not** take this approach.

.. warning::

You need to install an older version of Beautiful Soup:
- Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian
- Version 4.0.2 have been confirmed for 32-bit Ubuntu
- if you install `BeautifulSoup4`_ you must install either
`lxml`_ or `html5lib`_ or both.
:func:`~pandas.io.html.read_html` will **not** work with *only*
`BeautifulSoup4`_ installed.
- You are highly encouraged to read :ref:`HTML reading gotchas
<html-gotchas>`. It explains issues surrounding the installation and
usage of the above three libraries
- You may need to install an older version of `BeautifulSoup4`_:
- Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and
32-bit Ubuntu/Debian
- Additionally, if you're using `Anaconda`_ you should definitely
read :ref:`the gotchas about HTML parsing libraries <html-gotchas>`

- Any recent version of ``html5lib`` is okay.
- `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
.. note::

- if you're on a system with ``apt-get`` you can do

.. code-block:: sh

sudo apt-get build-dep python-lxml

to get the necessary dependencies for installation of `lxml`_. This
will prevent further headaches down the line.


.. _html5lib: https://github.com/html5lib/html5lib-python
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
.. _lxml: http://lxml.de
.. _Anaconda: https://store.continuum.io/cshop/anaconda


Installation from sources
Expand Down
109 changes: 109 additions & 0 deletions doc/source/gotchas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -344,3 +344,112 @@ where the data copying occurs.

See `this link <http://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe>`__
for more information.

.. _html-gotchas:

HTML Table Parsing
------------------
There are some versioning issues surrounding the libraries that are used to
parse HTML tables in the top-level pandas io function ``read_html``.

**Issues with** |lxml|_

* Benefits

* |lxml|_ is very fast

* |lxml|_ requires Cython to install correctly.

* Drawbacks

* |lxml|_ does *not* make any guarantees about the results of it's parse
*unless* it is given |svm|_.

* In light of the above, we have chosen to allow you, the user, to use the
|lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_
fails to parse

* It is therefore *highly recommended* that you install both
|BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid
result (provided everything else is valid) even if |lxml|_ fails.

**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend**

* The above issues hold here as well since |BeautifulSoup4|_ is essentially
just a wrapper around a parser backend.

**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend**

* Benefits

* |html5lib|_ is far more lenient than |lxml|_ and consequently deals
with *real-life markup* in a much saner way rather than just, e.g.,
dropping an element without notifying you.

* |html5lib|_ *generates valid HTML5 markup from invalid markup
automatically*. This is extremely important for parsing HTML tables,
since it guarantees a valid document. However, that does NOT mean that
it is "correct", since the process of fixing markup does not have a
single definition.

* |html5lib|_ is pure Python and requires no additional build steps beyond
its own installation.

* Drawbacks

* The biggest drawback to using |html5lib|_ is that it is slow as
molasses. However consider the fact that many tables on the web are not
big enough for the parsing algorithm runtime to matter. It is more
likely that the bottleneck will be in the process of reading the raw
text from the url over the web, i.e., IO (input-output). For very large
tables, this might not be true.

**Issues with using** |Anaconda|_

* `Anaconda`_ ships with `lxml`_ version 3.2.0; the following workaround for
`Anaconda`_ was successfully used to deal with the versioning issues
surrounding `lxml`_ and `BeautifulSoup4`_.

.. note::

Unless you have *both*:

* A strong restriction on the upper bound of the runtime of some code
that incorporates :func:`~pandas.io.html.read_html`
* Complete knowledge that the HTML you will be parsing will be 100%
valid at all times

then you should install `html5lib`_ and things will work swimmingly
without you having to muck around with `conda`. If you want the best of
both worlds then install both `html5lib`_ and `lxml`_. If you do install
`lxml`_ then you need to perform the following commands to ensure that
lxml will work correctly:

.. code-block:: sh

# remove the included version
conda remove lxml

# install the latest version of lxml
pip install 'git+git://github.com/lxml/lxml.git'

# install the latest version of beautifulsoup4
pip install 'bzr+lp:beautifulsoup'

Note that you need `bzr <http://bazaar.canonical.com/en>`_ and `git
<http://git-scm.com>`_ installed to perform the last two operations.

.. |svm| replace:: **strictly valid markup**
.. _svm: http://validator.w3.org/docs/help.html#validation_basics

.. |html5lib| replace:: **html5lib**
.. _html5lib: https://github.com/html5lib/html5lib-python

.. |BeautifulSoup4| replace:: **BeautifulSoup4**
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup

.. |lxml| replace:: **lxml**
.. _lxml: http://lxml.de

.. |Anaconda| replace:: **Anaconda**
.. _Anaconda: https://store.continuum.io/cshop/anaconda
46 changes: 39 additions & 7 deletions doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,17 +102,49 @@ Optional Dependencies
* `openpyxl <http://packages.python.org/openpyxl/>`__, `xlrd/xlwt <http://www.python-excel.org/>`__
* openpyxl version 1.6.1 or higher
* Needed for Excel I/O
* Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
`Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
reading HTML tables
* `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3
access.
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.io.html.read_html` function:

* `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
okay.)
* `BeautifulSoup4`_ and `lxml`_
* `BeautifulSoup4`_ and `html5lib`_ and `lxml`_
* Only `lxml`_, although see :ref:`HTML reading gotchas <html-gotchas>`
for reasons as to why you should probably **not** take this approach.

.. warning::

You need to install an older version of Beautiful Soup:
- Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian
- Version 4.0.2 have been confirmed for 32-bit Ubuntu
* if you install `BeautifulSoup4`_ you must install either
`lxml`_ or `html5lib`_ or both.
:func:`~pandas.io.html.read_html` will **not** work with *only*
`BeautifulSoup4`_ installed.
* You are highly encouraged to read :ref:`HTML reading gotchas
<html-gotchas>`. It explains issues surrounding the installation and
usage of the above three libraries
* You may need to install an older version of `BeautifulSoup4`_:
- Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and
32-bit Ubuntu/Debian
* Additionally, if you're using `Anaconda`_ you should definitely
read :ref:`the gotchas about HTML parsing libraries <html-gotchas>`

* Any recent version of ``html5lib`` is okay.
.. note::

* if you're on a system with ``apt-get`` you can do

.. code-block:: sh

sudo apt-get build-dep python-lxml

to get the necessary dependencies for installation of `lxml`_. This
will prevent further headaches down the line.


.. _html5lib: https://github.com/html5lib/html5lib-python
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
.. _lxml: http://lxml.de
.. _Anaconda: https://store.continuum.io/cshop/anaconda

.. note::

Expand Down
6 changes: 6 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -943,6 +943,12 @@ HTML
Reading HTML Content
~~~~~~~~~~~~~~~~~~~~~~

.. warning::

We **highly encourage** you to read the :ref:`HTML parsing gotchas
<html-gotchas>` regarding the issues surrounding the
BeautifulSoup4/html5lib/lxml parsers.

.. _io.read_html:

.. versionadded:: 0.11.1
Expand Down