Skip to content

Commit 7413da1

Browse files
committed
Merge pull request #3704 from cpcloud/read-write-html-docs
DOC: document read_html and to_html
2 parents 6c7e32b + e326b6e commit 7413da1

File tree

4 files changed

+253
-14
lines changed

4 files changed

+253
-14
lines changed

doc/source/io.rst

+227-4
Original file line numberDiff line numberDiff line change
@@ -938,18 +938,241 @@ Reading HTML Content
938938

939939
.. versionadded:: 0.11.1
940940

941-
The toplevel :func:`~pandas.io.parsers.read_html` function can accept an HTML
941+
The toplevel :func:`~pandas.io.html.read_html` function can accept an HTML
942942
string/file/url and will parse HTML tables into list of pandas DataFrames.
943+
Let's look at a few examples.
944+
945+
Read a URL with no options
946+
947+
.. ipython:: python
948+
949+
url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
950+
dfs = read_html(url)
951+
dfs
952+
953+
.. note::
954+
955+
``read_html`` returns a ``list`` of ``DataFrame`` objects, even if there is
956+
only a single table contained in the HTML content
957+
958+
Read a URL and match a table that contains specific text
959+
960+
.. ipython:: python
961+
962+
match = 'Metcalf Bank'
963+
df_list = read_html(url, match=match)
964+
len(dfs)
965+
dfs[0]
966+
967+
Specify a header row (by default ``<th>`` elements are used to form the column
968+
index); if specified, the header row is taken from the data minus the parsed
969+
header elements (``<th>`` elements).
970+
971+
.. ipython:: python
972+
973+
dfs = read_html(url, header=0)
974+
len(dfs)
975+
dfs[0]
976+
977+
Specify an index column
978+
979+
.. ipython:: python
980+
981+
dfs = read_html(url, index_col=0)
982+
len(dfs)
983+
dfs[0]
984+
dfs[0].index.name
985+
986+
Specify a number of rows to skip
987+
988+
.. ipython:: python
989+
990+
dfs = read_html(url, skiprows=0)
991+
len(dfs)
992+
dfs[0]
993+
994+
Specify a number of rows to skip using a list (``xrange`` (Python 2 only) works
995+
as well)
996+
997+
.. ipython:: python
998+
999+
dfs = read_html(url, skiprows=range(2))
1000+
len(dfs)
1001+
dfs[0]
1002+
1003+
Don't infer numeric and date types
1004+
1005+
.. ipython:: python
1006+
1007+
dfs = read_html(url, infer_types=False)
1008+
len(dfs)
1009+
dfs[0]
1010+
1011+
Specify an HTML attribute
1012+
1013+
.. ipython:: python
1014+
1015+
dfs = read_html(url)
1016+
len(dfs)
1017+
dfs[0]
1018+
1019+
Use some combination of the above
1020+
1021+
.. ipython:: python
1022+
1023+
dfs = read_html(url, match='Metcalf Bank', index_col=0)
1024+
len(dfs)
1025+
dfs[0]
1026+
1027+
Read in pandas ``to_html`` output (with some loss of floating point precision)
1028+
1029+
.. ipython:: python
1030+
1031+
df = DataFrame(randn(2, 2))
1032+
s = df.to_html(float_format='{0:.40g}'.format)
1033+
dfin = read_html(s, index_col=0)
1034+
df
1035+
dfin[0]
1036+
df.index
1037+
df.columns
1038+
dfin[0].index
1039+
dfin[0].columns
1040+
np.allclose(df, dfin[0])
9431041
9441042
9451043
Writing to HTML files
9461044
~~~~~~~~~~~~~~~~~~~~~~
9471045

9481046
.. _io.html:
9491047

950-
DataFrame object has an instance method ``to_html`` which renders the contents
951-
of the DataFrame as an html table. The function arguments are as in the method
952-
``to_string`` described above.
1048+
``DataFrame`` objects have an instance method ``to_html`` which renders the
1049+
contents of the ``DataFrame`` as an HTML table. The function arguments are as
1050+
in the method ``to_string`` described above.
1051+
1052+
.. note::
1053+
1054+
Not all of the possible options for ``DataFrame.to_html`` are shown here for
1055+
brevity's sake. See :func:`~pandas.DataFrame.to_html` for the full set of
1056+
options.
1057+
1058+
.. ipython:: python
1059+
:suppress:
1060+
1061+
def write_html(df, filename, *args, **kwargs):
1062+
static = os.path.abspath(os.path.join('source', '_static'))
1063+
with open(os.path.join(static, filename + '.html'), 'w') as f:
1064+
df.to_html(f, *args, **kwargs)
1065+
1066+
.. ipython:: python
1067+
1068+
df = DataFrame(randn(2, 2))
1069+
df
1070+
print df.to_html() # raw html
1071+
1072+
.. ipython:: python
1073+
:suppress:
1074+
1075+
write_html(df, 'basic')
1076+
1077+
HTML:
1078+
1079+
.. raw:: html
1080+
:file: _static/basic.html
1081+
1082+
The ``columns`` argument will limit the columns shown
1083+
1084+
.. ipython:: python
1085+
1086+
print df.to_html(columns=[0])
1087+
1088+
.. ipython:: python
1089+
:suppress:
1090+
1091+
write_html(df, 'columns', columns=[0])
1092+
1093+
HTML:
1094+
1095+
.. raw:: html
1096+
:file: _static/columns.html
1097+
1098+
``float_format`` takes a Python callable to control the precision of floating
1099+
point values
1100+
1101+
.. ipython:: python
1102+
1103+
print df.to_html(float_format='{0:.10f}'.format)
1104+
1105+
.. ipython:: python
1106+
:suppress:
1107+
1108+
write_html(df, 'float_format', float_format='{0:.10f}'.format)
1109+
1110+
HTML:
1111+
1112+
.. raw:: html
1113+
:file: _static/float_format.html
1114+
1115+
``bold_rows`` will make the row labels bold by default, but you can turn that
1116+
off
1117+
1118+
.. ipython:: python
1119+
1120+
print df.to_html(bold_rows=False)
1121+
1122+
.. ipython:: python
1123+
:suppress:
1124+
1125+
write_html(df, 'nobold', bold_rows=False)
1126+
1127+
.. raw:: html
1128+
:file: _static/nobold.html
1129+
1130+
The ``classes`` argument provides the ability to give the resulting HTML
1131+
table CSS classes. Note that these classes are *appended* to the existing
1132+
``'dataframe'`` class.
1133+
1134+
.. ipython:: python
1135+
1136+
print df.to_html(classes=['awesome_table_class', 'even_more_awesome_class'])
1137+
1138+
Finally, the ``escape`` argument allows you to control whether the
1139+
"<", ">" and "&" characters escaped in the resulting HTML (by default it is
1140+
``True``). So to get the HTML without escaped characters pass ``escape=False``
1141+
1142+
.. ipython:: python
1143+
1144+
df = DataFrame({'a': list('&<>'), 'b': randn(3)})
1145+
1146+
1147+
.. ipython:: python
1148+
:suppress:
1149+
1150+
write_html(df, 'escape')
1151+
write_html(df, 'noescape', escape=False)
1152+
1153+
Escaped:
1154+
1155+
.. ipython:: python
1156+
1157+
print df.to_html()
1158+
1159+
.. raw:: html
1160+
:file: _static/escape.html
1161+
1162+
Not escaped:
1163+
1164+
.. ipython:: python
1165+
1166+
print df.to_html(escape=False)
1167+
1168+
.. raw:: html
1169+
:file: _static/noescape.html
1170+
1171+
.. note::
1172+
1173+
Some browsers may not show a difference in the rendering of the previous two
1174+
HTML tables.
1175+
9531176

9541177
Clipboard
9551178
---------

doc/source/missing_data.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -357,7 +357,7 @@ Replace the '.' with ``nan`` (str -> str)
357357
:suppress:
358358
359359
from numpy.random import rand, randn
360-
nan = np.nan
360+
from numpy import nan
361361
from pandas import DataFrame
362362
363363
.. ipython:: python

pandas/core/frame.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1598,14 +1598,15 @@ def to_html(self, buf=None, columns=None, col_space=None, colSpace=None,
15981598
classes=None, escape=True):
15991599
"""
16001600
to_html-specific options
1601+
16011602
bold_rows : boolean, default True
16021603
Make the row labels bold in the output
16031604
classes : str or list or tuple, default None
16041605
CSS class(es) to apply to the resulting html table
16051606
escape : boolean, default True
16061607
Convert the characters <, >, and & to HTML-safe sequences.
16071608
1608-
Render a DataFrame to an html table.
1609+
Render a DataFrame as an HTML table.
16091610
"""
16101611

16111612
import warnings

pandas/io/html.py

+23-8
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
import numpy as np
2020

21-
from pandas import DataFrame, MultiIndex
21+
from pandas import DataFrame, MultiIndex, Index, Series, isnull
2222
from pandas.io.parsers import _is_url
2323

2424

@@ -398,7 +398,6 @@ def _parse_tables(self, doc, match, attrs):
398398
if not tables:
399399
raise AssertionError("No tables found matching "
400400
"'{0}'".format(match.pattern))
401-
#import ipdb; ipdb.set_trace()
402401
return tables
403402

404403
def _setup_build_doc(self):
@@ -560,6 +559,17 @@ def _parse_raw_tfoot(self, table):
560559
table.xpath(expr)]
561560

562561

562+
def _maybe_convert_index_type(index):
563+
try:
564+
index = index.astype(int)
565+
except (TypeError, ValueError):
566+
if not isinstance(index, MultiIndex):
567+
s = Series(index, name=index.name)
568+
index = Index(s.convert_objects(convert_numeric=True),
569+
name=index.name)
570+
return index
571+
572+
563573
def _data_to_frame(data, header, index_col, infer_types, skiprows):
564574
"""Parse a BeautifulSoup table into a DataFrame.
565575
@@ -620,6 +630,12 @@ def _data_to_frame(data, header, index_col, infer_types, skiprows):
620630
raise ValueError('Labels {0} not found when trying to skip'
621631
' rows'.format(it))
622632

633+
# convert to numbers/dates where possible
634+
# must be sequential since dates trump numbers if both args are given
635+
if infer_types:
636+
df = df.convert_objects(convert_numeric=True)
637+
df = df.convert_objects(convert_dates='coerce')
638+
623639
if header is not None:
624640
header_rows = df.iloc[header]
625641

@@ -632,11 +648,6 @@ def _data_to_frame(data, header, index_col, infer_types, skiprows):
632648

633649
df = df.drop(df.index[header])
634650

635-
# convert to numbers/dates where possible
636-
# must be sequential since dates trump numbers if both args are given
637-
if infer_types:
638-
df = df.convert_objects(convert_numeric=True)
639-
640651
if index_col is not None:
641652
cols = df.columns[index_col]
642653

@@ -648,12 +659,16 @@ def _data_to_frame(data, header, index_col, infer_types, skiprows):
648659
# drop by default
649660
df.set_index(cols, inplace=True)
650661
if df.index.nlevels == 1:
651-
if not (df.index.name or df.index.name is None):
662+
if isnull(df.index.name) or not df.index.name:
652663
df.index.name = None
653664
else:
654665
names = [name or None for name in df.index.names]
655666
df.index = MultiIndex.from_tuples(df.index.values, names=names)
656667

668+
if infer_types:
669+
df.index = _maybe_convert_index_type(df.index)
670+
df.columns = _maybe_convert_index_type(df.columns)
671+
657672
return df
658673

659674

0 commit comments

Comments
 (0)