Skip to content

read_html doesn't work for wikipedia tables #7762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
haraldschilly opened this issue Jul 15, 2014 · 4 comments · Fixed by #7851
Closed

read_html doesn't work for wikipedia tables #7762

haraldschilly opened this issue Jul 15, 2014 · 4 comments · Fixed by #7851
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@haraldschilly
Copy link

I assumed reading wikipedia html tables should work for read_html, but it returned a lot of garbage 😦

Example:

pd.io.html.read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area", "Arizona")

I've seen several related issues, maybe this is an useful test case.

Versions:

pandas: 0.14.0
Cython: 0.19.2
IPython: 2.1.0
bs4: 4.3.1
html5lib: 0.95-dev
lxml: 3.3.5
dateutil: 2.2
@jreback jreback added the HTML label Jul 16, 2014
@danielballan
Copy link
Contributor

I see what you mean. Specifically, I see three different problems here:

  1. Pandas is confused by the nested headers. Pass skiprows=2 and set the column names yourself.
  2. Pandas incorrectly tries to interpret some columns as dates, producing columns filled mostly with NaT. Passing infer_types=False fixes this. That option is marked for removal in version 0.14, but it still works for me in 0.14.1. I am not sure what the plans are there, but this example seems to illustrate that there is a still a need for this functionality.
  3. Wikipedia lets you interactively sort by the columns by incorporating a jquery sortkey in the table. This confuses the HTML parser in some instances (rank, land, water). It is mistakenly parsing the table-sorting code as part of the text. In the result, you can see the correct rank number in there, but it is prefixed by strings like !B9993068528194. Pandas string processing tools can recover this. Fixing the underlying parsing issue is probably up to the HTML parsers (html5lib, lxml, bs4) and not in scope for pandas.

So, to get your work done, add the kwargs skiprows=2, infer_types=False, and use string methods like Series.str.extract or Series.str.split to fix the rank, land, and water columns.

Can any other pandas folks comment on the plans for inter_types?

@haraldschilly
Copy link
Author

Thanks for looking into this. After reporting this I dug a bit into this and also noticed those odd sort columns and so on. So, well, it's not really up to pandas, but indeed part of what there is. On the other hand, if this "table pattern" is indeed very common to wikipedia, it could be worthwhile to implement a "wikipedia" mode to the parser? (Which is activated automatically for *.wikipeida.org URLs) … Just an idea, it would be cool, that's all 1️⃣ 🆙 😀

@cpcloud cpcloud added this to the 0.15.0 milestone Jul 26, 2014
@cpcloud cpcloud self-assigned this Jul 26, 2014
@cpcloud cpcloud added the Bug label Jul 26, 2014
@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2014

Hm infer_types is supposed to be obviated by using TextParser (this is where read_csv goes eventually), but something is amiss here, I'm not totally sure why the dates are so "greedy". Let me take a look.

@cpcloud
Copy link
Member

cpcloud commented Jul 26, 2014

@danielballan thanks for the explanation!

@danielballan @haraldschilly

i put up a pr to fix this ... #7851, check it out at your leisure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants