-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Scrape HTML tables into Dataframes #3369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is the goal to have a parser written just for pandas so that are no dependencies? Beautiful Soup + lxml does this quite effectively with a little coaxing. This would of course add dependencies which I'm guessing is frowned upon to avoid feature bloat. |
yep on all three. except the first. |
why is it a problem to add an optional dependency for bs or lxml for this parsing? we do this for excel if he user wants it they can install it |
I think so too. lxml I think, it's faster and recent versions have as good |
Bs4 can use lxml under the hood and is much easier to use than lxml. |
I've used both, and as I said recent lxml is very usable. But it really doesn't matter what you use. |
Although I guess that doesn't really matter if you're just exposing this API and want to minimize deps. |
just do something like: |
lxml is compiled and has some lib/.so deps, bs4's ability to abstract the underlying |
Ok, so bs4 it is. |
Are there any type inference utility functions hidden in the guts of pandas? |
|
Hm, from pandas import DataFrame
df = DataFrame(['March 25, 2005', 'March 27, 2001'])
df.convert_objects(convert_dates=True) # still strings This could be hacked around by using |
that's a 'soft' conversion (to force it use, 'force') (equiv to to_pydatetime on the column)
fyi...obv 0.11 needed (which assume you are using) |
oh whoops, of course I should've read the docs before saying anything :) |
also....since this is an internal method you might want to call these directly on the internal objects (e.g. |
ok. still in the experimental stage of parsing different examples of html tables to get a feel for what the idiosyncrasies are. nothing has been put in pandas yet, i've just got a file with a couple of functions. |
@cpcloud I would add an option, say |
@jreback Thanks. So far it seems that there will have to a be a user facing parameter to control the location of the table since there could be multiple tables on a page. I'm thinking two ways: something like a |
It would be great to have some URLs of tables that people are interested in parsing (other than the few that people have provided). I have a few examples, but it would be great to have a few more. |
@cpcloud , how are you handling table selection when there are multiple |
Right now I have a parameter called table_number that selects that is an Best, On Mon, Apr 22, 2013 at 11:33 AM, y-p [email protected] wrote:
|
Ideally, the user should be able to somehow specify the table he's after |
@y-p Agreed. However, one problem is the plethora of HTML that is either a) Best, On Mon, Apr 22, 2013 at 11:42 AM, y-p [email protected] wrote:
|
Doesn't need to be fancy, it's just a "nice-to-have", leave the corner cases for users to deal with. |
my 2c, if you don't specify a table selection criteria (as above), I would return a list of all tables. (kind of like returning a list of all elements e.g. elementTree tree type of stuff) |
@jreback Does that preclude having a class method DataFrame.from_html? I Best, On Mon, Apr 22, 2013 at 12:32 PM, jreback [email protected] wrote:
|
these class methods are a bit clunky, for example read_csv can return a series or frame so what's the point of a DataFrame.from_csv mainly just a legacy convention I think and I think your new from_html should go in a new module that said u could raise if u can't find a valid table (or more than one if the user specified criteria) |
Should this support BeautifulSoup 3? |
no need. |
Some of my tests are so slow it's almost epic. tear :( Really hope this isn't my function and might be the url. Must line_profile this before pushing it. |
maybe d/l some test tables and include as examples, then mark |
careful with copyright issues, please. |
only gov stuff will be included in the d/l'd data sets |
What is the criteria for marking a test as |
I would think most of yours would be @network? |
~300ms. |
Alrighty peoples I'm getting close to submitting a PR for this. Slight dilemma: I have two implementations. I have one that uses bs4 (and optionally, lxml) and the other that uses only lxml. The user facing API is the same. The lxml version is faster (anecdotal, I haven't measured anything yet, just looking at the nose output) but as was mentioned earlier there's the downside of having to install the binaries of lxml, which the bs4 implementation sidesteps with the disadvantage of being slower even when using lxml under the hood. Thoughts? |
easy, list both bs4 and lxml as optional dependencies (add a mention in install.rst) try to import (I guess lxml first, then bs4), then raise if you can't do anything |
added to ci/install.sh already, is that ok? builds are passing as we speak. |
One last issue, for the forbes billionaires list I occasionally get different monetary separators in two immediately sequential calls to |
Sounds like a wonky test, I'd leave it out. |
Yeah did that. Using rich Koreans table from Forbes. I have one last weirdness to fix . Will submit as soon as that is worked out. For some reason xpath isn't returning all rows of a table when some are children of thead elements. Probably going to special case if I can't figure it out by tonight. |
I meant those links to sanity check your code, not necessarily to include I went through your branch and have some notes, but that can wait for a PR |
Ok will do. |
Should this be closed? |
closed by #3477. |
from ML: https://groups.google.com/forum/?fromgroups=#!topic/pydata/q7VVD8YeSLk
User provides HTML string for whatever source he likes, or url.
optionally specify table id, or regex to match against contained cell
content to quickly single out
.+
tables, when multiple exist on the page.Pseudo:
Aside: Perhaps not widely known, but excel and co can import tables directly
from online webpages, a cheap "no code" way to get the data into a form
directly readable by pandas.
The text was updated successfully, but these errors were encountered: