Skip to content

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Aug 16, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d69ce74
ENH: pd.read_html argument to extract hrefs along with text from cells
abmyii Feb 13, 2022
ac86888
Fix typing error
abmyii Feb 14, 2022
b33dc9e
Simplify tests
abmyii Feb 15, 2022
a13c5f0
Fix still incorrect typing
abmyii Feb 15, 2022
76ebe35
Summarise whatsnew entry and move detailed explanation into user guide
abmyii Feb 17, 2022
cd352e7
More flexible link extraction
abmyii Feb 23, 2022
1de1324
Suggested changes
abmyii Feb 26, 2022
1190ea7
extract_hrefs -> extract_links
abmyii Feb 28, 2022
db8b6db
Move versionadded to correct place and improve docstring for extract_…
abmyii Mar 20, 2022
1c8c891
Test for invalid extract_links value
abmyii Mar 20, 2022
1555fbd
Test all extract_link options
abmyii Apr 2, 2022
0935696
Fix for MultiIndex headers (also fixes tests)
abmyii Apr 25, 2022
afaad1a
Test that text surrounding <a> tag is still captured
abmyii Apr 25, 2022
20e24e9
Test for multiple <a> tags in cell
abmyii Apr 25, 2022
ffdcf8a
Fix all tests, with both MultiIndex -> Index and np.nan -> None conve…
abmyii May 15, 2022
dbd4580
Merge branch 'main' into read_html-extract-hrefs
abmyii Jun 18, 2022
490005a
Add back EOF newline to test_html.py
abmyii Jun 18, 2022
a5ff5c1
Correct user guide example
abmyii Jun 18, 2022
85a183d
Merge branch 'main' into read_html-extract-hrefs
attack68 Jul 29, 2022
58fdb0c
Update pandas/io/html.py
attack68 Jul 29, 2022
c34d8ff
Update pandas/io/html.py
attack68 Jul 29, 2022
7389b84
Update pandas/io/html.py
attack68 Jul 29, 2022
ba7caab
Simplify MultiIndex -> Index conversion
abmyii Jul 30, 2022
4c7f532
Move unnecessary fixtures into test body
abmyii Jul 30, 2022
98a46e2
Simplify statement
abmyii Aug 16, 2022
fd41935
Merge branch 'main' into read_html-extract-hrefs
abmyii Aug 16, 2022
614c636
Fix code checks
abmyii Aug 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2729,6 +2729,28 @@ succeeds, the function will return*.

dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])

Links can be extracted from cells along with the text using ``extract_hrefs=True``.

.. ipython:: python

html_table = """
<table>
<tr>
<th>GitHub</th>
</tr>
<tr>
<td><a href="https://github.com/pandas-dev/pandas">pandas</a></td>
</tr>
</table>
"""

df = pd.read_html(
html_table,
extract_hrefs=True
)[0]
df
df["GitHub"]
df["GitHub"].str[1]

.. _io.html:

Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ Other enhancements
- :meth:`.GroupBy.min` and :meth:`.GroupBy.max` now supports `Numba <https://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`45428`)
- Implemented a ``bool``-dtype :class:`Index`, passing a bool-dtype array-like to ``pd.Index`` will now retain ``bool`` dtype instead of casting to ``object`` (:issue:`45061`)
- Implemented a complex-dtype :class:`Index`, passing a complex-dtype array-like to ``pd.Index`` will now retain complex dtype instead of casting to ``object`` (:issue:`45845`)
- :func:`pandas.read_html` now supports extracting hrefs from table cells (:issue:`13141`).


-

Expand Down
58 changes: 53 additions & 5 deletions pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,9 @@ class _HtmlFrameParser:
displayed_only : bool
Whether or not items with "display:none" should be ignored

extract_hrefs : bool, default False
Whether table elements with <a> tags should have the href extracted.

Attributes
----------
io : str or file-like
Expand All @@ -198,11 +201,15 @@ class _HtmlFrameParser:
displayed_only : bool
Whether or not items with "display:none" should be ignored

extract_hrefs : bool, default False
Whether table elements with <a> tags should have the href extracted.

Notes
-----
To subclass this class effectively you must override the following methods:
* :func:`_build_doc`
* :func:`_attr_getter`
* :func:`_href_getter`
* :func:`_text_getter`
* :func:`_parse_td`
* :func:`_parse_thead_tr`
Expand All @@ -221,12 +228,14 @@ def __init__(
attrs: dict[str, str] | None,
encoding: str,
displayed_only: bool,
extract_hrefs: bool,
):
self.io = io
self.match = match
self.attrs = attrs
self.encoding = encoding
self.displayed_only = displayed_only
self.extract_hrefs = extract_hrefs

def parse_tables(self):
"""
Expand Down Expand Up @@ -259,6 +268,22 @@ def _attr_getter(self, obj, attr):
# Both lxml and BeautifulSoup have the same implementation:
return obj.get(attr)

def _href_getter(self, obj):
"""
Return a href if the DOM node contains a child <a> or None.

Parameters
----------
obj : node-like
A DOM node.

Returns
-------
href : str or unicode
The href from the <a> child of the DOM node.
"""
raise AbstractMethodError(self)

def _text_getter(self, obj):
"""
Return the text of an individual DOM node.
Expand Down Expand Up @@ -435,20 +460,22 @@ def row_is_all_th(row):
while body_rows and row_is_all_th(body_rows[0]):
header_rows.append(body_rows.pop(0))

header = self._expand_colspan_rowspan(header_rows)
header = self._expand_colspan_rowspan(header_rows, header=True)
body = self._expand_colspan_rowspan(body_rows)
footer = self._expand_colspan_rowspan(footer_rows)

return header, body, footer

def _expand_colspan_rowspan(self, rows):
def _expand_colspan_rowspan(self, rows, header=False):
"""
Given a list of <tr>s, return a list of text rows.

Parameters
----------
rows : list of node-like
List of <tr>s
header : whether the current row is the header - don't capture links if so,
as this results in a MultiIndex which is undesirable.

Returns
-------
Expand All @@ -461,7 +488,10 @@ def _expand_colspan_rowspan(self, rows):
to subsequent cells.
"""
all_texts = [] # list of rows, each a list of str
remainder: list[tuple[int, str, int]] = [] # list of (index, text, nrows)
text: str | tuple
remainder: list[
tuple[int, str | tuple, int]
] = [] # list of (index, text, nrows)

for tr in rows:
texts = [] # the output for this row
Expand All @@ -481,6 +511,11 @@ def _expand_colspan_rowspan(self, rows):

# Append the text from this <td>, colspan times
text = _remove_whitespace(self._text_getter(td))
if not header and self.extract_hrefs:
# All cells will be tuples except for the headers for
# consistency in selection (e.g. using .str indexing)
href = self._href_getter(td)
text = (text, href) if href else (text,)
rowspan = int(self._attr_getter(td, "rowspan") or 1)
colspan = int(self._attr_getter(td, "colspan") or 1)

Expand Down Expand Up @@ -585,6 +620,10 @@ def _parse_tables(self, doc, match, attrs):
raise ValueError(f"No tables found matching pattern {repr(match.pattern)}")
return result

def _href_getter(self, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type the args and returns of all of the added code

Copy link
Contributor Author

@abmyii abmyii Feb 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've typed the returns, but won't lxml/bs4 be required to type the args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attack68 What shall I do about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Sorry to bother you, but I haven't been able to come up with a solution for this. Could you please suggest how I should do it?

To elaborate a bit on my first comment - the requirements may not be installed, and in that case the typing using the custom types defined in the libraries would fail (as far as I understand), so that doesn't seem like a viable solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Would you be able to enlighten me regarding this request? I'm still at a loss as to how to approach it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the top of the file you can do:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from bf4/lxml import ...

Then type obj. The CI checks have all the optional dependencies installed so these checks should be available.

a = obj.find("a", href=True)
return None if not a else a["href"]

def _text_getter(self, obj):
return obj.text

Expand Down Expand Up @@ -670,6 +709,10 @@ class _LxmlFrameParser(_HtmlFrameParser):
:class:`_HtmlFrameParser`.
"""

def _href_getter(self, obj):
href = obj.xpath(".//a/@href")
return None if not href else href[0]

def _text_getter(self, obj):
return obj.text_content()

Expand Down Expand Up @@ -906,14 +949,14 @@ def _validate_flavor(flavor):
return flavor


def _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs):
def _parse(flavor, io, match, attrs, encoding, displayed_only, extract_hrefs, **kwargs):
flavor = _validate_flavor(flavor)
compiled_match = re.compile(match) # you can pass a compiled regex here

retained = None
for flav in flavor:
parser = _parser_dispatch(flav)
p = parser(io, compiled_match, attrs, encoding, displayed_only)
p = parser(io, compiled_match, attrs, encoding, displayed_only, extract_hrefs)

try:
tables = p.parse_tables()
Expand Down Expand Up @@ -964,6 +1007,7 @@ def read_html(
na_values=None,
keep_default_na: bool = True,
displayed_only: bool = True,
extract_hrefs: bool = False,
) -> list[DataFrame]:
r"""
Read HTML tables into a ``list`` of ``DataFrame`` objects.
Expand Down Expand Up @@ -1058,6 +1102,9 @@ def read_html(
displayed_only : bool, default True
Whether elements with "display: none" should be parsed.

extract_hrefs : bool, default False
Whether table elements with <a> tags should have the href extracted.

Returns
-------
dfs
Expand Down Expand Up @@ -1126,4 +1173,5 @@ def read_html(
na_values=na_values,
keep_default_na=keep_default_na,
displayed_only=displayed_only,
extract_hrefs=extract_hrefs,
)
38 changes: 38 additions & 0 deletions pandas/tests/io/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -1286,3 +1286,41 @@ def test_parse_path_object(self, datapath):
df1 = self.read_html(file_path_string)[0]
df2 = self.read_html(file_path)[0]
tm.assert_frame_equal(df1, df2)

def test_extract_hrefs(self):
# GH 13141:
# read_html argument to interpret hyperlinks as links (not merely text)
result = self.read_html(
"""
<table>
<tr>
<th>HTTP</th>
<th>FTP</th>
<th><a href="https://en.wiktionary.org/wiki/linkless">None</a></th>
</tr>
<tr>
<td><a href="https://en.wikipedia.org/">Wikipedia</a></td>
<td><a href="ftp://ftp.us.debian.org/">Debian</a></td>
<td>Linkless</td>
</tr>
</table>
""",
extract_hrefs=True,
)[0]

expected = DataFrame(
[
[
("Wikipedia", "https://en.wikipedia.org/"),
("Debian", "ftp://ftp.us.debian.org/"),
("Linkless",),
]
],
columns=(
"HTTP",
"FTP",
"None",
),
)

tm.assert_frame_equal(result, expected)