Skip to content

Commit 53a6970

Browse files
Move HTML libraries gotchas to html io docs
1 parent 7185dd4 commit 53a6970

File tree

3 files changed

+81
-76
lines changed

3 files changed

+81
-76
lines changed

doc/source/gotchas.rst

-72
Original file line numberDiff line numberDiff line change
@@ -308,78 +308,6 @@ where the data copying occurs.
308308
See `this link <http://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe>`__
309309
for more information.
310310

311-
.. _gotchas.html:
312-
313-
HTML Table Parsing
314-
------------------
315-
There are some versioning issues surrounding the libraries that are used to
316-
parse HTML tables in the top-level pandas io function ``read_html``.
317-
318-
**Issues with** |lxml|_
319-
320-
* Benefits
321-
322-
* |lxml|_ is very fast
323-
324-
* |lxml|_ requires Cython to install correctly.
325-
326-
* Drawbacks
327-
328-
* |lxml|_ does *not* make any guarantees about the results of its parse
329-
*unless* it is given |svm|_.
330-
331-
* In light of the above, we have chosen to allow you, the user, to use the
332-
|lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_
333-
fails to parse
334-
335-
* It is therefore *highly recommended* that you install both
336-
|BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid
337-
result (provided everything else is valid) even if |lxml|_ fails.
338-
339-
**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend**
340-
341-
* The above issues hold here as well since |BeautifulSoup4|_ is essentially
342-
just a wrapper around a parser backend.
343-
344-
**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend**
345-
346-
* Benefits
347-
348-
* |html5lib|_ is far more lenient than |lxml|_ and consequently deals
349-
with *real-life markup* in a much saner way rather than just, e.g.,
350-
dropping an element without notifying you.
351-
352-
* |html5lib|_ *generates valid HTML5 markup from invalid markup
353-
automatically*. This is extremely important for parsing HTML tables,
354-
since it guarantees a valid document. However, that does NOT mean that
355-
it is "correct", since the process of fixing markup does not have a
356-
single definition.
357-
358-
* |html5lib|_ is pure Python and requires no additional build steps beyond
359-
its own installation.
360-
361-
* Drawbacks
362-
363-
* The biggest drawback to using |html5lib|_ is that it is slow as
364-
molasses. However consider the fact that many tables on the web are not
365-
big enough for the parsing algorithm runtime to matter. It is more
366-
likely that the bottleneck will be in the process of reading the raw
367-
text from the URL over the web, i.e., IO (input-output). For very large
368-
tables, this might not be true.
369-
370-
371-
.. |svm| replace:: **strictly valid markup**
372-
.. _svm: http://validator.w3.org/docs/help.html#validation_basics
373-
374-
.. |html5lib| replace:: **html5lib**
375-
.. _html5lib: https://github.com/html5lib/html5lib-python
376-
377-
.. |BeautifulSoup4| replace:: **BeautifulSoup4**
378-
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
379-
380-
.. |lxml| replace:: **lxml**
381-
.. _lxml: http://lxml.de
382-
383311

384312
Byte-Ordering Issues
385313
--------------------

doc/source/install.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,7 @@ Optional Dependencies
285285
okay.)
286286
* `BeautifulSoup4`_ and `lxml`_
287287
* `BeautifulSoup4`_ and `html5lib`_ and `lxml`_
288-
* Only `lxml`_, although see :ref:`HTML Table Parsing <faq.html>`
288+
* Only `lxml`_, although see :ref:`HTML Table Parsing <gotchas.html>`
289289
for reasons as to why you should probably **not** take this approach.
290290

291291
.. warning::
@@ -294,7 +294,7 @@ Optional Dependencies
294294
`lxml`_ or `html5lib`_ or both.
295295
:func:`~pandas.read_html` will **not** work with *only*
296296
`BeautifulSoup4`_ installed.
297-
* You are highly encouraged to read :ref:`HTML Table Parsing gotchas <gotchas.html>`.
297+
* You are highly encouraged to read :ref:`HTML Table Parsing gotchas <io.html.gotchas>`.
298298
It explains issues surrounding the installation and
299299
usage of the above three libraries.
300300
* You may need to install an older version of `BeautifulSoup4`_:

doc/source/io.rst

+79-2
Original file line numberDiff line numberDiff line change
@@ -2043,8 +2043,8 @@ Reading HTML Content
20432043

20442044
.. warning::
20452045

2046-
We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas<gotchas.html>`
2047-
regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
2046+
We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas<io.html.gotchas>`
2047+
below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
20482048

20492049
.. versionadded:: 0.12.0
20502050

@@ -2346,6 +2346,83 @@ Not escaped:
23462346
Some browsers may not show a difference in the rendering of the previous two
23472347
HTML tables.
23482348

2349+
2350+
.. _io.html.gotchas:
2351+
2352+
HTML Table Parsing Gotchas
2353+
''''''''''''''''''''''''''
2354+
2355+
There are some versioning issues surrounding the libraries that are used to
2356+
parse HTML tables in the top-level pandas io function ``read_html``.
2357+
2358+
**Issues with** |lxml|_
2359+
2360+
* Benefits
2361+
2362+
* |lxml|_ is very fast
2363+
2364+
* |lxml|_ requires Cython to install correctly.
2365+
2366+
* Drawbacks
2367+
2368+
* |lxml|_ does *not* make any guarantees about the results of its parse
2369+
*unless* it is given |svm|_.
2370+
2371+
* In light of the above, we have chosen to allow you, the user, to use the
2372+
|lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_
2373+
fails to parse
2374+
2375+
* It is therefore *highly recommended* that you install both
2376+
|BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid
2377+
result (provided everything else is valid) even if |lxml|_ fails.
2378+
2379+
**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend**
2380+
2381+
* The above issues hold here as well since |BeautifulSoup4|_ is essentially
2382+
just a wrapper around a parser backend.
2383+
2384+
**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend**
2385+
2386+
* Benefits
2387+
2388+
* |html5lib|_ is far more lenient than |lxml|_ and consequently deals
2389+
with *real-life markup* in a much saner way rather than just, e.g.,
2390+
dropping an element without notifying you.
2391+
2392+
* |html5lib|_ *generates valid HTML5 markup from invalid markup
2393+
automatically*. This is extremely important for parsing HTML tables,
2394+
since it guarantees a valid document. However, that does NOT mean that
2395+
it is "correct", since the process of fixing markup does not have a
2396+
single definition.
2397+
2398+
* |html5lib|_ is pure Python and requires no additional build steps beyond
2399+
its own installation.
2400+
2401+
* Drawbacks
2402+
2403+
* The biggest drawback to using |html5lib|_ is that it is slow as
2404+
molasses. However consider the fact that many tables on the web are not
2405+
big enough for the parsing algorithm runtime to matter. It is more
2406+
likely that the bottleneck will be in the process of reading the raw
2407+
text from the URL over the web, i.e., IO (input-output). For very large
2408+
tables, this might not be true.
2409+
2410+
2411+
.. |svm| replace:: **strictly valid markup**
2412+
.. _svm: http://validator.w3.org/docs/help.html#validation_basics
2413+
2414+
.. |html5lib| replace:: **html5lib**
2415+
.. _html5lib: https://github.com/html5lib/html5lib-python
2416+
2417+
.. |BeautifulSoup4| replace:: **BeautifulSoup4**
2418+
.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup
2419+
2420+
.. |lxml| replace:: **lxml**
2421+
.. _lxml: http://lxml.de
2422+
2423+
2424+
2425+
23492426
.. _io.excel:
23502427

23512428
Excel files

0 commit comments

Comments
 (0)