Skip to content

Commit a8723a4

Browse files
committed
ENH: read-html fixes
1 parent 381a690 commit a8723a4

File tree

10 files changed

+1095
-210
lines changed

10 files changed

+1095
-210
lines changed

README.rst

+5-6
Original file line numberDiff line numberDiff line change
@@ -92,12 +92,11 @@ Optional dependencies
9292
- openpyxl version 1.6.1 or higher, for writing .xlsx files
9393
- xlrd >= 0.9.0
9494
- Needed for Excel I/O
95-
- `lxml <http://lxml.de>`__, or `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for reading HTML tables
96-
- The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
97-
is faster), however sometimes Beautiful Soup returns what you might
98-
intuitively expect. Both backends are implemented, so try them both to
99-
see which one you like. They should return very similar results.
100-
- Note that lxml requires Cython to build successfully
95+
- Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
96+
`Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
97+
reading HTML tables
98+
- These can both easily be installed by ``pip install html5lib`` and ``pip
99+
install beautifulsoup4``.
101100
- `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
102101

103102

RELEASE.rst

+10-3
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ pandas 0.11.1
3030

3131
**New features**
3232

33-
- pd.read_html() can now parse HTML string, files or urls and return dataframes
34-
courtesy of @cpcloud. (GH3477_)
33+
- ``pandas.read_html()`` can now parse HTML strings, files or urls and
34+
returns a list of ``DataFrame`` s courtesy of @cpcloud. (GH3477_, GH3605_,
35+
GH3606_)
3536
- Support for reading Amazon S3 files. (GH3504_)
3637
- Added module for reading and writing Stata files: pandas.io.stata (GH1512_)
3738
- Added support for writing in ``to_csv`` and reading in ``read_csv``,
@@ -48,7 +49,7 @@ pandas 0.11.1
4849
**Improvements to existing features**
4950

5051
- Fixed various issues with internal pprinting code, the repr() for various objects
51-
including TimeStamp and *Index now produces valid python code strings and
52+
including TimeStamp and Index now produces valid python code strings and
5253
can be used to recreate the object, (GH3038_, GH3379_, GH3251_, GH3460_)
5354
- ``convert_objects`` now accepts a ``copy`` parameter (defaults to ``True``)
5455
- ``HDFStore``
@@ -146,6 +147,9 @@ pandas 0.11.1
146147
- ``sql.write_frame`` failing when writing a single column to sqlite (GH3628_),
147148
thanks to @stonebig
148149
- Fix pivoting with ``nan`` in the index (GH3558_)
150+
- Fix running of bs4 tests when it is not installed (GH3605_)
151+
- Fix parsing of html table (GH3606_)
152+
- ``read_html()`` now only allows a single backend: ``html5lib`` (GH3616_)
149153

150154
.. _GH3164: https://github.com/pydata/pandas/issues/3164
151155
.. _GH2786: https://github.com/pydata/pandas/issues/2786
@@ -209,6 +213,9 @@ pandas 0.11.1
209213
.. _GH3141: https://github.com/pydata/pandas/issues/3141
210214
.. _GH3628: https://github.com/pydata/pandas/issues/3628
211215
.. _GH3638: https://github.com/pydata/pandas/issues/3638
216+
.. _GH3605: https://github.com/pydata/pandas/issues/3605
217+
.. _GH3606: https://github.com/pydata/pandas/issues/3606
218+
.. _Gh3616: https://github.com/pydata/pandas/issues/3616
212219

213220
pandas 0.11.0
214221
=============

doc/source/install.rst

+5-6
Original file line numberDiff line numberDiff line change
@@ -99,12 +99,11 @@ Optional Dependencies
9999
* `openpyxl <http://packages.python.org/openpyxl/>`__, `xlrd/xlwt <http://www.python-excel.org/>`__
100100
* openpyxl version 1.6.1 or higher
101101
* Needed for Excel I/O
102-
* `lxml <http://lxml.de>`__, or `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for reading HTML tables
103-
* The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
104-
is faster), however sometimes Beautiful Soup returns what you might
105-
intuitively expect. Both backends are implemented, so try them both to
106-
see which one you like. They should return very similar results.
107-
* Note that lxml requires Cython to build successfully
102+
* Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
103+
`Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
104+
reading HTML tables
105+
* These can both easily be installed by ``pip install html5lib`` and ``pip
106+
install beautifulsoup4``.
108107

109108
.. note::
110109

doc/source/io.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -918,18 +918,18 @@ which, if set to ``True``, will additionally output the length of the Series.
918918
HTML
919919
----
920920

921-
Reading HTML format
921+
Reading HTML Content
922922
~~~~~~~~~~~~~~~~~~~~~~
923923

924924
.. _io.read_html:
925925

926926
.. versionadded:: 0.11.1
927927

928-
The toplevel :func:`~pandas.io.parsers.read_html` function can accept an HTML string/file/url
929-
and will parse HTML tables into pandas DataFrames.
928+
The toplevel :func:`~pandas.io.parsers.read_html` function can accept an HTML
929+
string/file/url and will parse HTML tables into list of pandas DataFrames.
930930

931931

932-
Writing to HTML format
932+
Writing to HTML files
933933
~~~~~~~~~~~~~~~~~~~~~~
934934

935935
.. _io.html:

doc/source/v0.11.1.txt

+24-3
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,27 @@ API changes
6464

6565
Enhancements
6666
~~~~~~~~~~~~
67-
68-
- ``pd.read_html()`` can now parse HTML string, files or urls and return dataframes
69-
courtesy of @cpcloud. (GH3477_)
67+
- ``pd.read_html()`` can now parse HTML strings, files or urls and return
68+
DataFrames
69+
courtesy of @cpcloud. (GH3477_, GH3605_, GH3606_)
70+
- ``read_html()`` (GH3616_)
71+
- now works with only a *single* parser backend, that is:
72+
- BeautifulSoup4 + html5lib
73+
- does *not* and will never support using the html parsing library
74+
included with Python as a parser backend
75+
- is a bit smarter about the parent table elements of matched text: if
76+
multiple matches are found then only the *unique* parents of the result
77+
are returned (uniqueness is determined using ``set``).
78+
- no longer tries to guess about what you want to do with empty table cells
79+
- argument ``infer_types`` now defaults to ``False``.
80+
- now returns DataFrames whose default column index is the elements of
81+
``<thead>`` elements in the HTML soup, if any exist.
82+
- considers all ``<th>`` and ``<td>`` elements inside of ``<thead>``
83+
elements.
84+
- tests are now correctly skipped if the proper libraries are not
85+
installed.
86+
- tests now include a ground-truth csv file from the FDIC failed bank list
87+
data set.
7088
- ``HDFStore``
7189

7290
- will retain index attributes (freq,tz,name) on recreation (GH3499_)
@@ -203,3 +221,6 @@ on GitHub for a complete list.
203221
.. _GH1651: https://github.com/pydata/pandas/issues/1651
204222
.. _GH3141: https://github.com/pydata/pandas/issues/3141
205223
.. _GH3638: https://github.com/pydata/pandas/issues/3638
224+
.. _GH3616: https://github.com/pydata/pandas/issues/3616
225+
.. _GH3605: https://github.com/pydata/pandas/issues/3605
226+
.. _GH3606: https://github.com/pydata/pandas/issues/3606

0 commit comments

Comments
 (0)