Scrape HTML tables into Dataframes #3369

ghost · 2013-04-15T18:28:50Z

from ML: https://groups.google.com/forum/?fromgroups=#!topic/pydata/q7VVD8YeSLk

User provides HTML string for whatever source he likes, or url.
optionally specify table id, or regex to match against contained cell
content to quickly single out .+ tables, when multiple exist on the page.

Pseudo:

DataFrame.from_html('http://foo.com/tickers?sym=GOOG',match="high")

Aside: Perhaps not widely known, but excel and co can import tables directly
from online webpages, a cheap "no code" way to get the data into a form
directly readable by pandas.

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-04-18T16:12:26Z

Is the goal to have a parser written just for pandas so that are no dependencies? Beautiful Soup + lxml does this quite effectively with a little coaxing. This would of course add dependencies which I'm guessing is frowned upon to avoid feature bloat.

ghost · 2013-04-18T16:16:25Z

yep on all three. except the first.

jreback · 2013-04-18T16:38:05Z

why is it a problem to add an optional dependency for bs or lxml for this parsing? we do this for excel

if he user wants it they can install it

ghost · 2013-04-18T16:39:57Z

I think so too. lxml I think, it's faster and recent versions have as good
support for css selectors as bs4.

cpcloud · 2013-04-18T16:44:36Z

Bs4 can use lxml under the hood and is much easier to use than lxml.

ghost · 2013-04-18T16:48:13Z

I've used both, and as I said recent lxml is very usable. But it really doesn't matter what you use.
If you'd like to claim this, would happily move it to 0.12..

cpcloud · 2013-04-18T16:48:20Z

Although I guess that doesn't really matter if you're just exposing this API and want to minimize deps.

changhiskhan · 2013-04-18T16:59:55Z

Also on SO: http://stackoverflow.com/questions/16009778/how-to-convert-a-html-table-into-pandas-dataframe

jreback · 2013-04-18T17:00:06Z

just do something like: from_html(data, method='lxml', ......) and support whatever methods you want, raise if you can't deal with it, or maybe flavor is better

ghost · 2013-04-18T17:04:57Z

lxml is compiled and has some lib/.so deps, bs4's ability to abstract the underlying
parser (including using a pure python solution) would make things smoother for windows
users. I think that clinches it.

cpcloud · 2013-04-18T18:07:06Z

Ok, so bs4 it is.

cpcloud · 2013-04-18T19:59:03Z

Are there any type inference utility functions hidden in the guts of pandas?

jreback · 2013-04-18T20:02:28Z

df.convert_objects(convert_numeric=True,convert_dates='coerce') will convert anything datelike and anything number like into the correct dtypes (and set the rest to NaN/NaT. This might be too aggressive, so you could start with convert_dates=True (I think you want convert_numeric=True always) for this type of thing

cpcloud · 2013-04-18T20:06:19Z

Hm, df.convert_objects doesn't seem to work on the most human readable of dates such as "March 17, 2005". E.g.,

from pandas import DataFrame
df = DataFrame(['March 25, 2005', 'March 27, 2001'])
df.convert_objects(convert_dates=True) # still strings

This could be hacked around by using dateutil.parser.parse but that's slow for a frame with many columns.

jreback · 2013-04-18T20:08:01Z

that's a 'soft' conversion (to force it use, 'force') (equiv to to_pydatetime on the column)

In [3]: df.convert_objects(convert_dates='coerce')
Out[3]: 
                    0
0 2005-03-25 00:00:00
1 2001-03-27 00:00:00

In [4]: df.convert_objects(convert_dates='coerce').dtypes
Out[4]: 
0    datetime64[ns]
dtype: object

fyi...obv 0.11 needed (which assume you are using)

cpcloud · 2013-04-18T20:09:09Z

oh whoops, of course I should've read the docs before saying anything :)

jreback · 2013-04-18T20:10:57Z

also....since this is an internal method you might want to call these directly on the internal objects (e.g. self._data)

cpcloud · 2013-04-18T20:13:34Z

ok. still in the experimental stage of parsing different examples of html tables to get a feel for what the idiosyncrasies are. nothing has been put in pandas yet, i've just got a file with a couple of functions.

jreback · 2013-04-18T20:29:57Z

@cpcloud I would add an option, say convert=True that controls whether you do object conversion inside your routine (and insted just returned the data as type object), possibly you could also accept a dtype parmeter to control the conversions as well (maybe v2 for though)

cpcloud · 2013-04-18T20:50:02Z

@jreback Thanks. So far it seems that there will have to a be a user facing parameter to control the location of the table since there could be multiple tables on a page. I'm thinking two ways: something like a table_number param that allows one to provide the 0 (or 1)-based index of the table's location on the page starting from the top left and/or passing the attrs dict from BeautifulSoup's methods which allows one to pass a dict of html attributes to use for selecting elements. One thing that would be nice is an exact way to determine which tables are used to lay out the page and which aren't (I'm guessing hardcore web devs frown upon using <table> elements for lay out). Will search around to see what's out there.

cpcloud · 2013-04-21T04:08:32Z

It would be great to have some URLs of tables that people are interested in parsing (other than the few that people have provided). I have a few examples, but it would be great to have a few more.

ghost · 2013-04-21T08:56:31Z

spam
rich people
fantasy baseball

ghost · 2013-04-22T15:33:34Z

@cpcloud , how are you handling table selection when there are multiple
tables on the page?

cpcloud · 2013-04-22T15:37:56Z

Right now I have a parameter called table_number that selects that is an
index into the list generated after soup.find_all('table', attrs) is called.

Best,
Phillip Cloud

On Mon, Apr 22, 2013 at 11:33 AM, y-p [email protected] wrote:

@cpcloud https://github.com/cpcloud , how are you handling table
selection when there are multiple
tables on the page?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3369#issuecomment-16795164
.

ghost · 2013-04-22T15:42:31Z

Ideally, the user should be able to somehow specify the table he's after
by visual inspection of the rendered page. no matter what sort of weirdness
is going on in the markup (nested tables, tables quietly used for layout, etc')

cpcloud · 2013-04-22T15:50:33Z

@y-p Agreed. However, one problem is the plethora of HTML that is either a)
invalid or b) doesn't follow conventions that make this doable in a
reasonable amount of time (I think). So I opted for practicality here (in
actuality I just adopted the convention of the ImportHtml function from
google docs). The problem with that is that the first table may appear as
formatting, but the user might think that the first table is something
else. In that case trial and error is the way to do it. I'm happy to take
suggestions here. In the long term it might be best to subclass ParserBase,
if the ParserBase API is stable.

Best,
Phillip Cloud

On Mon, Apr 22, 2013 at 11:42 AM, y-p [email protected] wrote:

Ideally, the user should be able to somehow specify the table he's after
by visual inspection of the rendered page. no matter what sort of weirdness
if going on in the markup (nested tables, tables quietly used for layout,
etc')

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3369#issuecomment-16795739
.

ghost · 2013-04-22T16:24:19Z

robustness , whatever the underlying parser dies on, dies.
table selection - exactly my point, the user should (optionally) specify a regexp unique to a datum
in the table (th/td.text), not an index.

Doesn't need to be fancy, it's just a "nice-to-have", leave the corner cases for users to deal with.
Would be good if you could test on something other then English, just to make sure
unicode nominally works.

jreback · 2013-04-22T16:31:54Z

my 2c, if you don't specify a table selection criteria (as above), I would return a list of all tables. (kind of like returning a list of all elements e.g. elementTree tree type of stuff)

cpcloud · 2013-04-22T17:38:35Z

@jreback Does that preclude having a class method DataFrame.from_html? I
suppose that method could raise if a single table couldn't be identified
with the input criteria and read_html could return a list of DataFrames
instead, although that seems inconsistent with read_csv and friends.

Best,
Phillip Cloud

On Mon, Apr 22, 2013 at 12:32 PM, jreback [email protected] wrote:

my 2c, if you don't specify a table selection criteria (as above), I would
return a list of all tables. (kind of like returning a list of all
elements e.g. elementTree tree type of stuff)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3369#issuecomment-16799463
.

jreback · 2013-04-22T21:06:27Z

these class methods are a bit clunky, for example

read_csv can return a series or frame

so what's the point of a DataFrame.from_csv

mainly just a legacy convention I think

and I think your new from_html should go in a new module
io.html (which we should also add to_html)

that said u could raise if u can't find a valid table (or more than one if the user specified criteria)

cpcloud · 2013-04-22T21:37:29Z

Should this support BeautifulSoup 3?

ghost · 2013-04-22T21:49:16Z

no need.

cpcloud · 2013-04-23T20:08:46Z

Some of my tests are so slow it's almost epic. tear :( Really hope this isn't my function and might be the url. Must line_profile this before pushing it.

jreback · 2013-04-23T20:11:00Z

maybe d/l some test tables and include as examples, then mark @network on other tests

ghost · 2013-04-23T20:28:12Z

careful with copyright issues, please.

cpcloud · 2013-04-23T20:34:26Z

only gov stuff will be included in the d/l'd data sets

cpcloud · 2013-04-24T15:44:25Z

What is the criteria for marking a test as @slow?

jreback · 2013-04-24T15:50:20Z

I would think most of yours would be @network?

ghost · 2013-04-24T15:52:24Z

~300ms.

cpcloud · 2013-04-24T23:25:05Z

Alrighty peoples I'm getting close to submitting a PR for this. Slight dilemma: I have two implementations. I have one that uses bs4 (and optionally, lxml) and the other that uses only lxml. The user facing API is the same. The lxml version is faster (anecdotal, I haven't measured anything yet, just looking at the nose output) but as was mentioned earlier there's the downside of having to install the binaries of lxml, which the bs4 implementation sidesteps with the disadvantage of being slower even when using lxml under the hood. Thoughts?

jreback · 2013-04-24T23:58:12Z

easy, list both bs4 and lxml as optional dependencies (add a mention in install.rst)

try to import (I guess lxml first, then bs4), then raise if you can't do anything
(import them in try except blocks in the method itself)
also need to add to travis full_deps

cpcloud · 2013-04-25T00:00:20Z

added to ci/install.sh already, is that ok? builds are passing as we speak.

cpcloud · 2013-04-26T15:08:09Z

One last issue, for the forbes billionaires list I occasionally get different monetary separators in two immediately sequential calls to read_html (causing my travis build to fail) and I can't figure where this is coming from. What's even more strange is that most of the time this doesn't happen. Forcing a locale doesn't seem to have any effect on this issue. Is it possible that this is on the Forbes end, e.g. they are routing the different calls to servers in different parts of the world and the locale is changed somewhere along the line?

ghost · 2013-04-27T20:19:37Z

Sounds like a wonky test, I'd leave it out.

cpcloud · 2013-04-27T20:24:51Z

Yeah did that. Using rich Koreans table from Forbes. I have one last weirdness to fix . Will submit as soon as that is worked out. For some reason xpath isn't returning all rows of a table when some are children of thead elements. Probably going to special case if I can't figure it out by tonight.

ghost · 2013-04-27T20:26:39Z

I meant those links to sanity check your code, not necessarily to include
in the suite, better just make it .gov sites.

I went through your branch and have some notes, but that can wait for a PR
when you're ready.

cpcloud · 2013-04-27T20:48:45Z

Ok will do.

cpcloud · 2013-05-04T00:07:01Z

Should this be closed?

ghost · 2013-05-04T00:15:40Z

closed by #3477.

cpcloud mentioned this issue Apr 29, 2013

Read html tables into DataFrames #3477

Merged

ghost closed this as completed May 4, 2013

This issue was closed.

Scrape HTML tables into Dataframes #3369

Scrape HTML tables into Dataframes #3369

Comments

ghost commented Apr 15, 2013

cpcloud commented Apr 18, 2013

ghost commented Apr 18, 2013

jreback commented Apr 18, 2013

ghost commented Apr 18, 2013

cpcloud commented Apr 18, 2013

ghost commented Apr 18, 2013

cpcloud commented Apr 18, 2013

changhiskhan commented Apr 18, 2013

jreback commented Apr 18, 2013

ghost commented Apr 18, 2013

cpcloud commented Apr 18, 2013

cpcloud commented Apr 18, 2013

jreback commented Apr 18, 2013

cpcloud commented Apr 18, 2013

jreback commented Apr 18, 2013

cpcloud commented Apr 18, 2013

jreback commented Apr 18, 2013

cpcloud commented Apr 18, 2013

jreback commented Apr 18, 2013

cpcloud commented Apr 18, 2013

cpcloud commented Apr 21, 2013

ghost commented Apr 21, 2013

ghost commented Apr 22, 2013

cpcloud commented Apr 22, 2013

ghost commented Apr 22, 2013

cpcloud commented Apr 22, 2013

ghost commented Apr 22, 2013

jreback commented Apr 22, 2013

cpcloud commented Apr 22, 2013

jreback commented Apr 22, 2013

cpcloud commented Apr 22, 2013

ghost commented Apr 22, 2013

cpcloud commented Apr 23, 2013

jreback commented Apr 23, 2013

ghost commented Apr 23, 2013

cpcloud commented Apr 23, 2013

cpcloud commented Apr 24, 2013

jreback commented Apr 24, 2013

ghost commented Apr 24, 2013

cpcloud commented Apr 24, 2013

jreback commented Apr 24, 2013

cpcloud commented Apr 25, 2013

cpcloud commented Apr 26, 2013

ghost commented Apr 27, 2013

cpcloud commented Apr 27, 2013

ghost commented Apr 27, 2013

cpcloud commented Apr 27, 2013

cpcloud commented May 4, 2013

ghost commented May 4, 2013