Skip to content

Commit 6518c79

Browse files
author
y-p
committed
Merge branch 'cpcloud_read_html'
* cpcloud_read_html: DOC: update RELEASE.rst ENH: add ability to read html tables directly into DataFrames
2 parents a6fed22 + 702dbf8 commit 6518c79

File tree

12 files changed

+7179
-9
lines changed

12 files changed

+7179
-9
lines changed

RELEASE.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,8 @@ pandas 0.11.1
3030

3131
**New features**
3232

33-
-
33+
- pd.read_html() can now parse HTML string, files or urls and return dataframes
34+
courtesy of @cpcloud. (GH3477_)
3435

3536
**Improvements to existing features**
3637

@@ -88,6 +89,7 @@ pandas 0.11.1
8889
.. _GH3437: https://github.com/pydata/pandas/issues/3437
8990
.. _GH3455: https://github.com/pydata/pandas/issues/3455
9091
.. _GH3457: https://github.com/pydata/pandas/issues/3457
92+
.. _GH3477: https://github.com/pydata/pandas/issues/3457
9193
.. _GH3461: https://github.com/pydata/pandas/issues/3461
9294
.. _GH3468: https://github.com/pydata/pandas/issues/3468
9395
.. _GH3448: https://github.com/pydata/pandas/issues/3448

ci/install.sh

+2
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,8 @@ if ( ! $VENV_FILE_AVAILABLE ); then
7575
pip install $PIP_ARGS xlrd>=0.9.0
7676
pip install $PIP_ARGS 'http://downloads.sourceforge.net/project/pytseries/scikits.timeseries/0.91.3/scikits.timeseries-0.91.3.tar.gz?r='
7777
pip install $PIP_ARGS patsy
78+
pip install $PIP_ARGS lxml
79+
pip install $PIP_ARGS beautifulsoup4
7880

7981
# fool statsmodels into thinking pandas was already installed
8082
# so it won't refuse to install itself. We want it in the zipped venv

doc/source/api.rst

+7
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,13 @@ File IO
5050
read_csv
5151
ExcelFile.parse
5252

53+
.. currentmodule:: pandas.io.html
54+
55+
.. autosummary::
56+
:toctree: generated/
57+
58+
read_html
59+
5360
HDFStore: PyTables (HDF5)
5461
~~~~~~~~~~~~~~~~~~~~~~~~~
5562
.. currentmodule:: pandas.io.pytables

doc/source/install.rst

+6
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,12 @@ Optional Dependencies
9999
* `openpyxl <http://packages.python.org/openpyxl/>`__, `xlrd/xlwt <http://www.python-excel.org/>`__
100100
* openpyxl version 1.6.1 or higher
101101
* Needed for Excel I/O
102+
* `lxml <http://lxml.de>`__, or `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for reading HTML tables
103+
* The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
104+
is faster), however sometimes Beautiful Soup returns what you might
105+
intuitively expect. Both backends are implemented, so try them both to
106+
see which one you like. They should return very similar results.
107+
* Note that lxml requires Cython to build successfully
102108

103109
.. note::
104110

doc/source/v0.11.1.txt

+3
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,12 @@ API changes
1212

1313
Enhancements
1414
~~~~~~~~~~~~
15+
- pd.read_html() can now parse HTML string, files or urls and return dataframes
16+
courtesy of @cpcloud. (GH3477_)
1517

1618
See the `full release notes
1719
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
1820
on GitHub for a complete list.
1921

2022
.. _GH2437: https://github.com/pydata/pandas/issues/2437
23+
.. _GH3477: https://github.com/pydata/pandas/issues/3477

pandas/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
read_fwf, to_clipboard, ExcelFile,
3434
ExcelWriter)
3535
from pandas.io.pytables import HDFStore, Term, get_store, read_hdf
36+
from pandas.io.html import read_html
3637
from pandas.util.testing import debug
3738

3839
from pandas.tools.describe import value_range

0 commit comments

Comments
 (0)