Skip to content

Scrape HTML tables into Dataframes #3369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Apr 15, 2013 · 49 comments
Closed

Scrape HTML tables into Dataframes #3369

ghost opened this issue Apr 15, 2013 · 49 comments
Labels
Ideas Long-Term Enhancement Discussions
Milestone

Comments

@ghost
Copy link

ghost commented Apr 15, 2013

from ML: https://groups.google.com/forum/?fromgroups=#!topic/pydata/q7VVD8YeSLk

User provides HTML string for whatever source he likes, or url.
optionally specify table id, or regex to match against contained cell
content to quickly single out .+ tables, when multiple exist on the page.

Pseudo:

DataFrame.from_html('http://foo.com/tickers?sym=GOOG',match="high")

Aside: Perhaps not widely known, but excel and co can import tables directly
from online webpages, a cheap "no code" way to get the data into a form
directly readable by pandas.

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

Is the goal to have a parser written just for pandas so that are no dependencies? Beautiful Soup + lxml does this quite effectively with a little coaxing. This would of course add dependencies which I'm guessing is frowned upon to avoid feature bloat.

@ghost
Copy link
Author

ghost commented Apr 18, 2013

yep on all three. except the first.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

why is it a problem to add an optional dependency for bs or lxml for this parsing? we do this for excel

if he user wants it they can install it

@ghost
Copy link
Author

ghost commented Apr 18, 2013

I think so too. lxml I think, it's faster and recent versions have as good
support for css selectors as bs4.

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

Bs4 can use lxml under the hood and is much easier to use than lxml.

@ghost
Copy link
Author

ghost commented Apr 18, 2013

I've used both, and as I said recent lxml is very usable. But it really doesn't matter what you use.
If you'd like to claim this, would happily move it to 0.12..

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

Although I guess that doesn't really matter if you're just exposing this API and want to minimize deps.

@changhiskhan
Copy link
Contributor

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

just do something like: from_html(data, method='lxml', ......) and support whatever methods you want, raise if you can't deal with it, or maybe flavor is better

@ghost
Copy link
Author

ghost commented Apr 18, 2013

lxml is compiled and has some lib/.so deps, bs4's ability to abstract the underlying
parser (including using a pure python solution) would make things smoother for windows
users. I think that clinches it.

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

Ok, so bs4 it is.

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

Are there any type inference utility functions hidden in the guts of pandas?

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

df.convert_objects(convert_numeric=True,convert_dates='coerce') will convert anything datelike and anything number like into the correct dtypes (and set the rest to NaN/NaT. This might be too aggressive, so you could start with convert_dates=True (I think you want convert_numeric=True always) for this type of thing

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

Hm, df.convert_objects doesn't seem to work on the most human readable of dates such as "March 17, 2005". E.g.,

from pandas import DataFrame
df = DataFrame(['March 25, 2005', 'March 27, 2001'])
df.convert_objects(convert_dates=True) # still strings

This could be hacked around by using dateutil.parser.parse but that's slow for a frame with many columns.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

that's a 'soft' conversion (to force it use, 'force') (equiv to to_pydatetime on the column)

In [3]: df.convert_objects(convert_dates='coerce')
Out[3]: 
                    0
0 2005-03-25 00:00:00
1 2001-03-27 00:00:00

In [4]: df.convert_objects(convert_dates='coerce').dtypes
Out[4]: 
0    datetime64[ns]
dtype: object

fyi...obv 0.11 needed (which assume you are using)

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

oh whoops, of course I should've read the docs before saying anything :)

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

also....since this is an internal method you might want to call these directly on the internal objects (e.g. self._data)

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

ok. still in the experimental stage of parsing different examples of html tables to get a feel for what the idiosyncrasies are. nothing has been put in pandas yet, i've just got a file with a couple of functions.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2013

@cpcloud I would add an option, say convert=True that controls whether you do object conversion inside your routine (and insted just returned the data as type object), possibly you could also accept a dtype parmeter to control the conversions as well (maybe v2 for though)

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2013

@jreback Thanks. So far it seems that there will have to a be a user facing parameter to control the location of the table since there could be multiple tables on a page. I'm thinking two ways: something like a table_number param that allows one to provide the 0 (or 1)-based index of the table's location on the page starting from the top left and/or passing the attrs dict from BeautifulSoup's methods which allows one to pass a dict of html attributes to use for selecting elements. One thing that would be nice is an exact way to determine which tables are used to lay out the page and which aren't (I'm guessing hardcore web devs frown upon using <table> elements for lay out). Will search around to see what's out there.

@cpcloud
Copy link
Member

cpcloud commented Apr 21, 2013

It would be great to have some URLs of tables that people are interested in parsing (other than the few that people have provided). I have a few examples, but it would be great to have a few more.

@ghost
Copy link
Author

ghost commented Apr 21, 2013

@ghost
Copy link
Author

ghost commented Apr 22, 2013

@cpcloud , how are you handling table selection when there are multiple
tables on the page?

@cpcloud
Copy link
Member

cpcloud commented Apr 22, 2013

Right now I have a parameter called table_number that selects that is an
index into the list generated after soup.find_all('table', attrs) is called.

Best,
Phillip Cloud

On Mon, Apr 22, 2013 at 11:33 AM, y-p [email protected] wrote:

@cpcloud https://github.com/cpcloud , how are you handling table
selection when there are multiple
tables on the page?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3369#issuecomment-16795164
.

@ghost
Copy link
Author

ghost commented Apr 22, 2013

Ideally, the user should be able to somehow specify the table he's after
by visual inspection of the rendered page. no matter what sort of weirdness
is going on in the markup (nested tables, tables quietly used for layout, etc')

@cpcloud
Copy link
Member

cpcloud commented Apr 22, 2013

@y-p Agreed. However, one problem is the plethora of HTML that is either a)
invalid or b) doesn't follow conventions that make this doable in a
reasonable amount of time (I think). So I opted for practicality here (in
actuality I just adopted the convention of the ImportHtml function from
google docs). The problem with that is that the first table may appear as
formatting, but the user might think that the first table is something
else. In that case trial and error is the way to do it. I'm happy to take
suggestions here. In the long term it might be best to subclass ParserBase,
if the ParserBase API is stable.

Best,
Phillip Cloud

On Mon, Apr 22, 2013 at 11:42 AM, y-p [email protected] wrote:

Ideally, the user should be able to somehow specify the table he's after
by visual inspection of the rendered page. no matter what sort of weirdness
if going on in the markup (nested tables, tables quietly used for layout,
etc')


Reply to this email directly or view it on GitHubhttps://github.com//issues/3369#issuecomment-16795739
.

@ghost
Copy link
Author

ghost commented Apr 22, 2013

  • robustness , whatever the underlying parser dies on, dies.
  • table selection - exactly my point, the user should (optionally) specify a regexp unique to a datum
    in the table (th/td.text), not an index.

Doesn't need to be fancy, it's just a "nice-to-have", leave the corner cases for users to deal with.
Would be good if you could test on something other then English, just to make sure
unicode nominally works.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2013

my 2c, if you don't specify a table selection criteria (as above), I would return a list of all tables. (kind of like returning a list of all elements e.g. elementTree tree type of stuff)

@cpcloud
Copy link
Member

cpcloud commented Apr 22, 2013

@jreback Does that preclude having a class method DataFrame.from_html? I
suppose that method could raise if a single table couldn't be identified
with the input criteria and read_html could return a list of DataFrames
instead, although that seems inconsistent with read_csv and friends.

Best,
Phillip Cloud

On Mon, Apr 22, 2013 at 12:32 PM, jreback [email protected] wrote:

my 2c, if you don't specify a table selection criteria (as above), I would
return a list of all tables. (kind of like returning a list of all
elements e.g. elementTree tree type of stuff)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3369#issuecomment-16799463
.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2013

these class methods are a bit clunky, for example

read_csv can return a series or frame

so what's the point of a DataFrame.from_csv

mainly just a legacy convention I think

and I think your new from_html should go in a new module
io.html (which we should also add to_html)

that said u could raise if u can't find a valid table (or more than one if the user specified criteria)

@cpcloud
Copy link
Member

cpcloud commented Apr 22, 2013

Should this support BeautifulSoup 3?

@ghost
Copy link
Author

ghost commented Apr 22, 2013

no need.

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2013

Some of my tests are so slow it's almost epic. tear :( Really hope this isn't my function and might be the url. Must line_profile this before pushing it.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2013

maybe d/l some test tables and include as examples, then mark @network on other tests

@ghost
Copy link
Author

ghost commented Apr 23, 2013

careful with copyright issues, please.

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2013

only gov stuff will be included in the d/l'd data sets

@cpcloud
Copy link
Member

cpcloud commented Apr 24, 2013

What is the criteria for marking a test as @slow?

@jreback
Copy link
Contributor

jreback commented Apr 24, 2013

I would think most of yours would be @network?

@ghost
Copy link
Author

ghost commented Apr 24, 2013

~300ms.

@cpcloud
Copy link
Member

cpcloud commented Apr 24, 2013

Alrighty peoples I'm getting close to submitting a PR for this. Slight dilemma: I have two implementations. I have one that uses bs4 (and optionally, lxml) and the other that uses only lxml. The user facing API is the same. The lxml version is faster (anecdotal, I haven't measured anything yet, just looking at the nose output) but as was mentioned earlier there's the downside of having to install the binaries of lxml, which the bs4 implementation sidesteps with the disadvantage of being slower even when using lxml under the hood. Thoughts?

@jreback
Copy link
Contributor

jreback commented Apr 24, 2013

easy, list both bs4 and lxml as optional dependencies (add a mention in install.rst)

try to import (I guess lxml first, then bs4), then raise if you can't do anything
(import them in try except blocks in the method itself)
also need to add to travis full_deps

@cpcloud
Copy link
Member

cpcloud commented Apr 25, 2013

added to ci/install.sh already, is that ok? builds are passing as we speak.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2013

One last issue, for the forbes billionaires list I occasionally get different monetary separators in two immediately sequential calls to read_html (causing my travis build to fail) and I can't figure where this is coming from. What's even more strange is that most of the time this doesn't happen. Forcing a locale doesn't seem to have any effect on this issue. Is it possible that this is on the Forbes end, e.g. they are routing the different calls to servers in different parts of the world and the locale is changed somewhere along the line?

@ghost
Copy link
Author

ghost commented Apr 27, 2013

Sounds like a wonky test, I'd leave it out.

@cpcloud
Copy link
Member

cpcloud commented Apr 27, 2013

Yeah did that. Using rich Koreans table from Forbes. I have one last weirdness to fix . Will submit as soon as that is worked out. For some reason xpath isn't returning all rows of a table when some are children of thead elements. Probably going to special case if I can't figure it out by tonight.

@ghost
Copy link
Author

ghost commented Apr 27, 2013

I meant those links to sanity check your code, not necessarily to include
in the suite, better just make it .gov sites.

I went through your branch and have some notes, but that can wait for a PR
when you're ready.

@cpcloud
Copy link
Member

cpcloud commented Apr 27, 2013

Ok will do.

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

Should this be closed?

@ghost
Copy link
Author

ghost commented May 4, 2013

closed by #3477.

@ghost ghost closed this as completed May 4, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ideas Long-Term Enhancement Discussions
Projects
None yet
Development

No branches or pull requests

3 participants