[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141

zeluspudding · 2016-05-11T15:56:37Z

For starters I'd just like to share that pandas is AWESOME! Thank you to all who contribute and make this lean, mean number crunching machine available to others like me.

pandas.read_html is super convenient for making dataframes from the web! The power of the web is that it's just that... a web of many highly interconnected data chunks. Being able to get the hyperlinks in an HTML table instead of the plain text that is hyperlinked would be a great productivity boost for web scraping.

Take the example of getting links to SEC forms. Here, one is probably more interested in getting the link to an insider trading disclosure form rather then merely the type of form (typically form 4).

Baking this functionality into read_html also averts the need to double parse HTML as in this StackOverflow post - http://stackoverflow.com/questions/31771619/html-table-to-pandas-table-info-inside-html-tags

The text was updated successfully, but these errors were encountered:

nono-london · 2019-08-18T17:46:12Z

Hi,
I am working exactly on the same type of project on SEC, and have same issue.
I was going through pandas doc and wondered is could be done using converter and a bs4 function to get href, or if converters will the the "text that is hyperlinked", which I suppose it will...

TomAugspurger · 2019-08-19T16:32:19Z

Probably in scope. I don't think anyone is working on this.

It seems like #14608 has the start of a proposed API.

hindamosh · 2019-08-27T11:10:36Z

Pandas is awesome really , also I have the same suggestion, in a try to scrape a website directly with the href tags with no need to other pacakges and /functions

adamrossnelson · 2019-10-17T18:48:03Z

Another reference that could be useful in developing this additional feature, option, argument...

https://stackoverflow.com/questions/42285417/how-to-preserve-links-when-scraping-a-table-with-beautiful-soup-and-pandas

KIC · 2021-02-13T13:11:50Z

The linked SO answer does not work with merged cells like rowspan i.e. https://www.interactivebrokers.com/en/index.php?f=1563&p=fut

Amaelb mentioned this issue Nov 7, 2016

Feature Request: expose full DOM nodes to converters in html_read #14608

Open

jbrockmendel added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jul 25, 2018

mroeschke added the Enhancement label May 7, 2020

abmyii mentioned this issue Feb 13, 2022

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

Merged

4 tasks

mroeschke closed this as completed in #45973 Aug 16, 2022

pauluhlenbruck mentioned this issue Sep 6, 2023

BUG: read_html() extracts only a single link from cells even if there are multiple in the same cell. #54708

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141

[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141

zeluspudding commented May 11, 2016 •

edited

Loading

nono-london commented Aug 18, 2019

TomAugspurger commented Aug 19, 2019

hindamosh commented Aug 27, 2019

adamrossnelson commented Oct 17, 2019

KIC commented Feb 13, 2021

[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141

[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text) #13141

Comments

zeluspudding commented May 11, 2016 • edited Loading

nono-london commented Aug 18, 2019

TomAugspurger commented Aug 19, 2019

hindamosh commented Aug 27, 2019

adamrossnelson commented Oct 17, 2019

KIC commented Feb 13, 2021

zeluspudding commented May 11, 2016 •

edited

Loading