Skip to content

Commit 3ab2fe4

Browse files
brianhueymattip
authored andcommitted
ENH: read_html() handles tables with multiple header rows pandas-dev#13434
closes pandas-dev#13434 Author: Brian <[email protected]> Author: S. Brian Huey <[email protected]> Closes pandas-dev#15242 from brianhuey/thead-improvement and squashes the following commits: fc1c80e [S. Brian Huey] Merge branch 'master' into thead-improvement b54aa0c [Brian] removed duplicate test case 6ae2860 [Brian] updated docstring and io.rst 41fe8cd [Brian] review changes 873ea58 [Brian] switched from range to lrange cd70225 [Brian] ENH:read_html() handles tables with multiple header rows pandas-dev#13434
1 parent df95cb3 commit 3ab2fe4

File tree

4 files changed

+43
-20
lines changed

4 files changed

+43
-20
lines changed

doc/source/io.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -2222,9 +2222,10 @@ Read a URL and match a table that contains specific text
22222222
match = 'Metcalf Bank'
22232223
df_list = pd.read_html(url, match=match)
22242224
2225-
Specify a header row (by default ``<th>`` elements are used to form the column
2226-
index); if specified, the header row is taken from the data minus the parsed
2227-
header elements (``<th>`` elements).
2225+
Specify a header row (by default ``<th>`` or ``<td>`` elements located within a
2226+
``<thead>`` are used to form the column index, if multiple rows are contained within
2227+
``<thead>`` then a multiindex is created); if specified, the header row is taken
2228+
from the data minus the parsed header elements (``<th>`` elements).
22282229

22292230
.. code-block:: python
22302231

doc/source/whatsnew/v0.20.0.txt

+7-6
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ Other Enhancements
283283
- ``DataFrame`` has gained a ``nunique()`` method to count the distinct values over an axis (:issue:`14336`).
284284
- ``DataFrame.groupby()`` has gained a ``.nunique()`` method to count the distinct values for all columns within each group (:issue:`14336`, :issue:`15197`).
285285

286-
- ``pd.read_excel`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)
286+
- ``pd.read_excel()`` now preserves sheet order when using ``sheetname=None`` (:issue:`9930`)
287287
- Multiple offset aliases with decimal points are now supported (e.g. '0.5min' is parsed as '30s') (:issue:`8419`)
288288
- ``.isnull()`` and ``.notnull()`` have been added to ``Index`` object to make them more consistent with the ``Series`` API (:issue:`15300`)
289289

@@ -294,8 +294,8 @@ Other Enhancements
294294
- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (:issue:`14714`, :issue:`14798`)
295295
- ``pd.qcut`` has gained the ``duplicates='raise'|'drop'`` option to control whether to raise on duplicated edges (:issue:`7751`)
296296
- ``Series`` provides a ``to_excel`` method to output Excel files (:issue:`8825`)
297-
- The ``usecols`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`14154`)
298-
- The ``skiprows`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`10882`)
297+
- The ``usecols`` argument in ``pd.read_csv()`` now accepts a callable function as a value (:issue:`14154`)
298+
- The ``skiprows`` argument in ``pd.read_csv()`` now accepts a callable function as a value (:issue:`10882`)
299299
- The ``nrows`` and ``chunksize`` arguments in ``pd.read_csv()`` are supported if both are passed (:issue:`6774`, :issue:`15755`)
300300
- ``pd.DataFrame.plot`` now prints a title above each subplot if ``suplots=True`` and ``title`` is a list of strings (:issue:`14753`)
301301
- ``pd.Series.interpolate`` now supports timedelta as an index type with ``method='time'`` (:issue:`6424`)
@@ -309,6 +309,7 @@ Other Enhancements
309309
- ``pandas.tools.hashing`` has gained a ``hash_tuples`` routine, and ``hash_pandas_object`` has gained the ability to hash a ``MultiIndex`` (:issue:`15224`)
310310
- ``Series/DataFrame.squeeze()`` have gained the ``axis`` parameter. (:issue:`15339`)
311311
- ``DataFrame.to_excel()`` has a new ``freeze_panes`` parameter to turn on Freeze Panes when exporting to Excel (:issue:`15160`)
312+
- ``pd.read_html()`` will parse multiple header rows, creating a multiindex header. (:issue:`13434`).
312313
- HTML table output skips ``colspan`` or ``rowspan`` attribute if equal to 1. (:issue:`15403`)
313314

314315
- ``pd.TimedeltaIndex`` now has a custom datetick formatter specifically designed for nanosecond level precision (:issue:`8711`)
@@ -813,7 +814,7 @@ Other API Changes
813814
^^^^^^^^^^^^^^^^^
814815

815816
- ``numexpr`` version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (:issue:`15213`).
816-
- ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`)
817+
- ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv()`` and will be removed in the future (:issue:`12665`)
817818
- ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`)
818819
- ``DataFrame.applymap()`` with an empty ``DataFrame`` will return a copy of the empty ``DataFrame`` instead of a ``Series`` (:issue:`8222`)
819820
- ``.loc`` has compat with ``.ix`` for accepting iterators, and NamedTuples (:issue:`15120`)
@@ -926,7 +927,7 @@ Bug Fixes
926927
- Bug in ``pd.to_numeric()`` in which float and unsigned integer elements were being improperly casted (:issue:`14941`, :issue:`15005`)
927928
- Cleaned up ``PeriodIndex`` constructor, including raising on floats more consistently (:issue:`13277`)
928929
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
929-
- Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
930+
- Bug in ``pd.read_fwf()`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
930931
- Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`)
931932
- Bug in ``pd.read_csv()`` in which a file containing a row with many columns followed by rows with fewer columns would cause a crash (:issue:`14125`)
932933
- Added checks in ``pd.read_csv()`` ensuring that values for ``nrows`` and ``chunksize`` are valid (:issue:`15767`)
@@ -1054,4 +1055,4 @@ Bug Fixes
10541055
- Bug in ``DataFrame.boxplot`` where ``fontsize`` was not applied to the tick labels on both axes (:issue:`15108`)
10551056
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
10561057
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
1057-
- Bug in ``pd.read_msgpack`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)
1058+
- Bug in ``pd.read_msgpack()`` which did not allow to load dataframe with an index of type ``CategoricalIndex`` (:issue:`15487`)

pandas/io/html.py

+20-11
Original file line numberDiff line numberDiff line change
@@ -355,9 +355,12 @@ def _parse_raw_thead(self, table):
355355
thead = self._parse_thead(table)
356356
res = []
357357
if thead:
358-
res = lmap(self._text_getter, self._parse_th(thead[0]))
359-
return np.atleast_1d(
360-
np.array(res).squeeze()) if res and len(res) == 1 else res
358+
trs = self._parse_tr(thead[0])
359+
for tr in trs:
360+
cols = lmap(self._text_getter, self._parse_td(tr))
361+
if any([col != '' for col in cols]):
362+
res.append(cols)
363+
return res
361364

362365
def _parse_raw_tfoot(self, table):
363366
tfoot = self._parse_tfoot(table)
@@ -591,9 +594,17 @@ def _parse_tfoot(self, table):
591594
return table.xpath('.//tfoot')
592595

593596
def _parse_raw_thead(self, table):
594-
expr = './/thead//th'
595-
return [_remove_whitespace(x.text_content()) for x in
596-
table.xpath(expr)]
597+
expr = './/thead'
598+
thead = table.xpath(expr)
599+
res = []
600+
if thead:
601+
trs = self._parse_tr(thead[0])
602+
for tr in trs:
603+
cols = [_remove_whitespace(x.text_content()) for x in
604+
self._parse_td(tr)]
605+
if any([col != '' for col in cols]):
606+
res.append(cols)
607+
return res
597608

598609
def _parse_raw_tfoot(self, table):
599610
expr = './/tfoot//th|//tfoot//td'
@@ -615,19 +626,17 @@ def _data_to_frame(**kwargs):
615626
head, body, foot = kwargs.pop('data')
616627
header = kwargs.pop('header')
617628
kwargs['skiprows'] = _get_skiprows(kwargs['skiprows'])
618-
619629
if head:
620-
body = [head] + body
621-
630+
rows = lrange(len(head))
631+
body = head + body
622632
if header is None: # special case when a table has <th> elements
623-
header = 0
633+
header = 0 if rows == [0] else rows
624634

625635
if foot:
626636
body += [foot]
627637

628638
# fill out elements of body that are "ragged"
629639
_expand_elements(body)
630-
631640
tp = TextParser(body, header=header, **kwargs)
632641
df = tp.read()
633642
return df

pandas/tests/io/test_html.py

+12
Original file line numberDiff line numberDiff line change
@@ -760,6 +760,18 @@ def test_keep_default_na(self):
760760
html_df = read_html(html_data, keep_default_na=True)[0]
761761
tm.assert_frame_equal(expected_df, html_df)
762762

763+
def test_multiple_header_rows(self):
764+
# Issue #13434
765+
expected_df = DataFrame(data=[("Hillary", 68, "D"),
766+
("Bernie", 74, "D"),
767+
("Donald", 69, "R")])
768+
expected_df.columns = [["Unnamed: 0_level_0", "Age", "Party"],
769+
["Name", "Unnamed: 1_level_1",
770+
"Unnamed: 2_level_1"]]
771+
html = expected_df.to_html(index=False)
772+
html_df = read_html(html, )[0]
773+
tm.assert_frame_equal(expected_df, html_df)
774+
763775

764776
def _lang_enc(filename):
765777
return os.path.splitext(os.path.basename(filename))[0].split('_')

0 commit comments

Comments
 (0)