DOC: document the various pitfalls of reading html

cpcloud · cpcloud · commit d38ecc39d5da · 2013-06-04T16:00:20.000-04:00
DOC: change formatting

DOC: more formatting

DOC: add bold substitutions

DOC: fill out bold links and rephrase

DOC: fill out link to gotchas

DOC: add gigantic install.rst warning

DOC: move version note about html5lib to first mention of it

DOC: add same to readme and add boto to install.rst

DOC: add anaconda note

DOC: add note about debian based system installation

DOC: add correct lexer for pygments formatting of code snippets

DOC: move boto up
diff --git a/README.rst b/README.rst
@@ -93,18 +93,49 @@ Optional dependencies
      - openpyxl version 1.6.1 or higher, for writing .xlsx files
      - xlrd >= 0.9.0
      - Needed for Excel I/O
-  - Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
-    `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
-    reading HTML tables
+  - `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3
+    access.
+  - One of the following combinations of libraries is needed to use the
+    top-level :func:`~pandas.io.html.read_html` function:
+
+    - `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
+      okay.)
+    - `BeautifulSoup4`_ and `lxml`_ 
+    - `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ 
+    - Only `lxml`_, although see :ref:`HTML reading gotchas <html-gotchas>`
+      for reasons as to why you should probably **not** take this approach.
 
     .. warning::
 
-       You need to install an older version of Beautiful Soup:
-           - Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian
-           - Version 4.0.2 have been confirmed for 32-bit Ubuntu
+       - if you install `BeautifulSoup4`_ you must install either
+         `lxml`_ or `html5lib`_ or both.
+         :func:`~pandas.io.html.read_html` will **not** work with *only*
+         `BeautifulSoup4`_ installed.
+       - You are highly encouraged to read :ref:`HTML reading gotchas
+         <html-gotchas>`. It explains issues surrounding the installation and
+         usage of the above three libraries
+       - You may need to install an older version of `BeautifulSoup4`_:
+           - Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and
+             32-bit Ubuntu/Debian
+       - Additionally, if you're using `Anaconda`_ you should definitely
+         read :ref:`the gotchas about HTML parsing libraries <html-gotchas>`
 
-    - Any recent version of ``html5lib`` is okay.
-  - `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
+    .. note::
+
+       - if you're on a system with ``apt-get`` you can do
+
+         .. code-block:: sh
+
+            sudo apt-get build-dep python-lxml
+
+         to get the necessary dependencies for installation of `lxml`_. This
+         will prevent further headaches down the line.
+
+
+.. _html5lib: https://github.com/html5lib/html5lib-python
+.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
+.. _lxml: http://lxml.de
+.. _Anaconda: https://store.continuum.io/cshop/anaconda
 
 
 Installation from sources
diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst
@@ -344,3 +344,112 @@ where the data copying occurs.
 
 See `this link <http://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe>`__
 for more information.
+
+.. _html-gotchas:
+
+HTML Table Parsing
+------------------
+There are some versioning issues surrounding the libraries that are used to
+parse HTML tables in the top-level pandas io function ``read_html``.
+
+**Issues with** |lxml|_
+
+   * Benefits
+
+     * |lxml|_ is very fast
+
+     * |lxml|_ requires Cython to install correctly.
+
+   * Drawbacks
+
+     * |lxml|_ does *not* make any guarantees about the results of it's parse
+       *unless* it is given |svm|_.
+
+     * In light of the above, we have chosen to allow you, the user, to use the
+       |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_
+       fails to parse
+
+     * It is therefore *highly recommended* that you install both
+       |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid
+       result (provided everything else is valid) even if |lxml|_ fails.
+
+**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend**
+
+   * The above issues hold here as well since |BeautifulSoup4|_ is essentially
+     just a wrapper around a parser backend.
+
+**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend**
+
+   * Benefits
+
+     * |html5lib|_ is far more lenient than |lxml|_ and consequently deals
+       with *real-life markup* in a much saner way rather than just, e.g.,
+       dropping an element without notifying you.
+
+     * |html5lib|_ *generates valid HTML5 markup from invalid markup
+       automatically*. This is extremely important for parsing HTML tables,
+       since it guarantees a valid document. However, that does NOT mean that
+       it is "correct", since the process of fixing markup does not have a
+       single definition.
+
+     * |html5lib|_ is pure Python and requires no additional build steps beyond
+       its own installation.
+
+   * Drawbacks
+
+     * The biggest drawback to using |html5lib|_ is that it is slow as
+       molasses.  However consider the fact that many tables on the web are not
+       big enough for the parsing algorithm runtime to matter. It is more
+       likely that the bottleneck will be in the process of reading the raw
+       text from the url over the web, i.e., IO (input-output). For very large
+       tables, this might not be true.
+
+**Issues with using** |Anaconda|_
+
+   * `Anaconda`_ ships with `lxml`_ version 3.2.0; the following workaround for
+     `Anaconda`_ was successfully used to deal with the versioning issues
+     surrounding `lxml`_ and `BeautifulSoup4`_.
+
+   .. note::
+
+      Unless you have *both*:
+
+         * A strong restriction on the upper bound of the runtime of some code
+           that incorporates :func:`~pandas.io.html.read_html`
+         * Complete knowledge that the HTML you will be parsing will be 100%
+           valid at all times
+
+      then you should install `html5lib`_ and things will work swimmingly
+      without you having to muck around with `conda`. If you want the best of
+      both worlds then install both `html5lib`_ and `lxml`_. If you do install
+      `lxml`_ then you need to perform the following commands to ensure that
+      lxml will work correctly:
+
+      .. code-block:: sh
+         
+         # remove the included version
+         conda remove lxml
+
+         # install the latest version of lxml
+         pip install 'git+git://github.com/lxml/lxml.git'
+
+         # install the latest version of beautifulsoup4
+         pip install 'bzr+lp:beautifulsoup'
+
+      Note that you need `bzr <http://bazaar.canonical.com/en>`_ and `git
+      <http://git-scm.com>`_ installed to perform the last two operations.
+
+.. |svm| replace:: **strictly valid markup**
+.. _svm: http://validator.w3.org/docs/help.html#validation_basics
+
+.. |html5lib| replace:: **html5lib**
+.. _html5lib: https://github.com/html5lib/html5lib-python
+
+.. |BeautifulSoup4| replace:: **BeautifulSoup4**
+.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
+
+.. |lxml| replace:: **lxml**
+.. _lxml: http://lxml.de
+
+.. |Anaconda| replace:: **Anaconda**
+.. _Anaconda: https://store.continuum.io/cshop/anaconda
diff --git a/doc/source/install.rst b/doc/source/install.rst
@@ -102,17 +102,49 @@ Optional Dependencies
   * `openpyxl <http://packages.python.org/openpyxl/>`__, `xlrd/xlwt <http://www.python-excel.org/>`__
      * openpyxl version 1.6.1 or higher
      * Needed for Excel I/O
-  * Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
-    `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
-    reading HTML tables
+  * `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3
+    access.
+  * One of the following combinations of libraries is needed to use the
+    top-level :func:`~pandas.io.html.read_html` function:
+
+    * `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is
+      okay.)
+    * `BeautifulSoup4`_ and `lxml`_ 
+    * `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ 
+    * Only `lxml`_, although see :ref:`HTML reading gotchas <html-gotchas>`
+      for reasons as to why you should probably **not** take this approach.
 
     .. warning::
 
-       You need to install an older version of Beautiful Soup:
-           - Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian
-           - Version 4.0.2 have been confirmed for 32-bit Ubuntu
+       * if you install `BeautifulSoup4`_ you must install either
+         `lxml`_ or `html5lib`_ or both.
+         :func:`~pandas.io.html.read_html` will **not** work with *only*
+         `BeautifulSoup4`_ installed.
+       * You are highly encouraged to read :ref:`HTML reading gotchas
+         <html-gotchas>`. It explains issues surrounding the installation and
+         usage of the above three libraries
+       * You may need to install an older version of `BeautifulSoup4`_:
+           - Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and
+             32-bit Ubuntu/Debian
+       * Additionally, if you're using `Anaconda`_ you should definitely
+         read :ref:`the gotchas about HTML parsing libraries <html-gotchas>`
 
-    * Any recent version of ``html5lib`` is okay.
+    .. note::
+
+       * if you're on a system with ``apt-get`` you can do
+
+         .. code-block:: sh
+
+            sudo apt-get build-dep python-lxml
+
+         to get the necessary dependencies for installation of `lxml`_. This
+         will prevent further headaches down the line.
+
+
+.. _html5lib: https://github.com/html5lib/html5lib-python
+.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
+.. _lxml: http://lxml.de
+.. _Anaconda: https://store.continuum.io/cshop/anaconda
 
 .. note::
 
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -943,6 +943,12 @@ HTML
 Reading HTML Content
 ~~~~~~~~~~~~~~~~~~~~~~
 
+.. warning::
+
+   We **highly encourage** you to read the :ref:`HTML parsing gotchas
+   <html-gotchas>` regarding the issues surrounding the
+   BeautifulSoup4/html5lib/lxml parsers.
+
 .. _io.read_html:
 
 .. versionadded:: 0.11.1