From d38ecc39d5da6a9197a080586d5096108ddb008d Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Mon, 3 Jun 2013 17:53:53 -0400 Subject: [PATCH] DOC: document the various pitfalls of reading html DOC: change formatting DOC: more formatting DOC: add bold substitutions DOC: fill out bold links and rephrase DOC: fill out link to gotchas DOC: add gigantic install.rst warning DOC: move version note about html5lib to first mention of it DOC: add same to readme and add boto to install.rst DOC: add anaconda note DOC: add note about debian based system installation DOC: add correct lexer for pygments formatting of code snippets DOC: move boto up --- README.rst | 47 +++++++++++++++--- doc/source/gotchas.rst | 109 +++++++++++++++++++++++++++++++++++++++++ doc/source/install.rst | 46 ++++++++++++++--- doc/source/io.rst | 6 +++ 4 files changed, 193 insertions(+), 15 deletions(-) diff --git a/README.rst b/README.rst index a74a155cf8a27..daea702476ebc 100644 --- a/README.rst +++ b/README.rst @@ -93,18 +93,49 @@ Optional dependencies - openpyxl version 1.6.1 or higher, for writing .xlsx files - xlrd >= 0.9.0 - Needed for Excel I/O - - Both `html5lib `__ **and** - `Beautiful Soup 4 `__: for - reading HTML tables + - `boto `__: necessary for Amazon S3 + access. + - One of the following combinations of libraries is needed to use the + top-level :func:`~pandas.io.html.read_html` function: + + - `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is + okay.) + - `BeautifulSoup4`_ and `lxml`_ + - `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ + - Only `lxml`_, although see :ref:`HTML reading gotchas ` + for reasons as to why you should probably **not** take this approach. .. warning:: - You need to install an older version of Beautiful Soup: - - Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian - - Version 4.0.2 have been confirmed for 32-bit Ubuntu + - if you install `BeautifulSoup4`_ you must install either + `lxml`_ or `html5lib`_ or both. + :func:`~pandas.io.html.read_html` will **not** work with *only* + `BeautifulSoup4`_ installed. + - You are highly encouraged to read :ref:`HTML reading gotchas + `. It explains issues surrounding the installation and + usage of the above three libraries + - You may need to install an older version of `BeautifulSoup4`_: + - Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and + 32-bit Ubuntu/Debian + - Additionally, if you're using `Anaconda`_ you should definitely + read :ref:`the gotchas about HTML parsing libraries ` - - Any recent version of ``html5lib`` is okay. - - `boto `__: necessary for Amazon S3 access. + .. note:: + + - if you're on a system with ``apt-get`` you can do + + .. code-block:: sh + + sudo apt-get build-dep python-lxml + + to get the necessary dependencies for installation of `lxml`_. This + will prevent further headaches down the line. + + +.. _html5lib: https://github.com/html5lib/html5lib-python +.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup +.. _lxml: http://lxml.de +.. _Anaconda: https://store.continuum.io/cshop/anaconda Installation from sources diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index 7b184f6d5043f..422e3cec59386 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -344,3 +344,112 @@ where the data copying occurs. See `this link `__ for more information. + +.. _html-gotchas: + +HTML Table Parsing +------------------ +There are some versioning issues surrounding the libraries that are used to +parse HTML tables in the top-level pandas io function ``read_html``. + +**Issues with** |lxml|_ + + * Benefits + + * |lxml|_ is very fast + + * |lxml|_ requires Cython to install correctly. + + * Drawbacks + + * |lxml|_ does *not* make any guarantees about the results of it's parse + *unless* it is given |svm|_. + + * In light of the above, we have chosen to allow you, the user, to use the + |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ + fails to parse + + * It is therefore *highly recommended* that you install both + |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid + result (provided everything else is valid) even if |lxml|_ fails. + +**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** + + * The above issues hold here as well since |BeautifulSoup4|_ is essentially + just a wrapper around a parser backend. + +**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** + + * Benefits + + * |html5lib|_ is far more lenient than |lxml|_ and consequently deals + with *real-life markup* in a much saner way rather than just, e.g., + dropping an element without notifying you. + + * |html5lib|_ *generates valid HTML5 markup from invalid markup + automatically*. This is extremely important for parsing HTML tables, + since it guarantees a valid document. However, that does NOT mean that + it is "correct", since the process of fixing markup does not have a + single definition. + + * |html5lib|_ is pure Python and requires no additional build steps beyond + its own installation. + + * Drawbacks + + * The biggest drawback to using |html5lib|_ is that it is slow as + molasses. However consider the fact that many tables on the web are not + big enough for the parsing algorithm runtime to matter. It is more + likely that the bottleneck will be in the process of reading the raw + text from the url over the web, i.e., IO (input-output). For very large + tables, this might not be true. + +**Issues with using** |Anaconda|_ + + * `Anaconda`_ ships with `lxml`_ version 3.2.0; the following workaround for + `Anaconda`_ was successfully used to deal with the versioning issues + surrounding `lxml`_ and `BeautifulSoup4`_. + + .. note:: + + Unless you have *both*: + + * A strong restriction on the upper bound of the runtime of some code + that incorporates :func:`~pandas.io.html.read_html` + * Complete knowledge that the HTML you will be parsing will be 100% + valid at all times + + then you should install `html5lib`_ and things will work swimmingly + without you having to muck around with `conda`. If you want the best of + both worlds then install both `html5lib`_ and `lxml`_. If you do install + `lxml`_ then you need to perform the following commands to ensure that + lxml will work correctly: + + .. code-block:: sh + + # remove the included version + conda remove lxml + + # install the latest version of lxml + pip install 'git+git://github.com/lxml/lxml.git' + + # install the latest version of beautifulsoup4 + pip install 'bzr+lp:beautifulsoup' + + Note that you need `bzr `_ and `git + `_ installed to perform the last two operations. + +.. |svm| replace:: **strictly valid markup** +.. _svm: http://validator.w3.org/docs/help.html#validation_basics + +.. |html5lib| replace:: **html5lib** +.. _html5lib: https://github.com/html5lib/html5lib-python + +.. |BeautifulSoup4| replace:: **BeautifulSoup4** +.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup + +.. |lxml| replace:: **lxml** +.. _lxml: http://lxml.de + +.. |Anaconda| replace:: **Anaconda** +.. _Anaconda: https://store.continuum.io/cshop/anaconda diff --git a/doc/source/install.rst b/doc/source/install.rst index 005e213fe24de..6868969c1b968 100644 --- a/doc/source/install.rst +++ b/doc/source/install.rst @@ -102,17 +102,49 @@ Optional Dependencies * `openpyxl `__, `xlrd/xlwt `__ * openpyxl version 1.6.1 or higher * Needed for Excel I/O - * Both `html5lib `__ **and** - `Beautiful Soup 4 `__: for - reading HTML tables + * `boto `__: necessary for Amazon S3 + access. + * One of the following combinations of libraries is needed to use the + top-level :func:`~pandas.io.html.read_html` function: + + * `BeautifulSoup4`_ and `html5lib`_ (Any recent version of `html5lib`_ is + okay.) + * `BeautifulSoup4`_ and `lxml`_ + * `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ + * Only `lxml`_, although see :ref:`HTML reading gotchas ` + for reasons as to why you should probably **not** take this approach. .. warning:: - You need to install an older version of Beautiful Soup: - - Version 4.1.3 and 4.0.2 have been confirmed for 64-bit Ubuntu/Debian - - Version 4.0.2 have been confirmed for 32-bit Ubuntu + * if you install `BeautifulSoup4`_ you must install either + `lxml`_ or `html5lib`_ or both. + :func:`~pandas.io.html.read_html` will **not** work with *only* + `BeautifulSoup4`_ installed. + * You are highly encouraged to read :ref:`HTML reading gotchas + `. It explains issues surrounding the installation and + usage of the above three libraries + * You may need to install an older version of `BeautifulSoup4`_: + - Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and + 32-bit Ubuntu/Debian + * Additionally, if you're using `Anaconda`_ you should definitely + read :ref:`the gotchas about HTML parsing libraries ` - * Any recent version of ``html5lib`` is okay. + .. note:: + + * if you're on a system with ``apt-get`` you can do + + .. code-block:: sh + + sudo apt-get build-dep python-lxml + + to get the necessary dependencies for installation of `lxml`_. This + will prevent further headaches down the line. + + +.. _html5lib: https://github.com/html5lib/html5lib-python +.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup +.. _lxml: http://lxml.de +.. _Anaconda: https://store.continuum.io/cshop/anaconda .. note:: diff --git a/doc/source/io.rst b/doc/source/io.rst index 27d3e21fea2c4..802ab08e85932 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -943,6 +943,12 @@ HTML Reading HTML Content ~~~~~~~~~~~~~~~~~~~~~~ +.. warning:: + + We **highly encourage** you to read the :ref:`HTML parsing gotchas + ` regarding the issues surrounding the + BeautifulSoup4/html5lib/lxml parsers. + .. _io.read_html: .. versionadded:: 0.11.1