read_html doesn't work for wikipedia tables #7762

haraldschilly · 2014-07-15T17:08:10Z

I assumed reading wikipedia html tables should work for read_html, but it returned a lot of garbage 😦

Example:

pd.io.html.read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area", "Arizona")

I've seen several related issues, maybe this is an useful test case.

Versions:

pandas: 0.14.0
Cython: 0.19.2
IPython: 2.1.0
bs4: 4.3.1
html5lib: 0.95-dev
lxml: 3.3.5
dateutil: 2.2

The text was updated successfully, but these errors were encountered:

danielballan · 2014-07-26T14:03:58Z

I see what you mean. Specifically, I see three different problems here:

Pandas is confused by the nested headers. Pass skiprows=2 and set the column names yourself.
Pandas incorrectly tries to interpret some columns as dates, producing columns filled mostly with NaT. Passing infer_types=False fixes this. That option is marked for removal in version 0.14, but it still works for me in 0.14.1. I am not sure what the plans are there, but this example seems to illustrate that there is a still a need for this functionality.
Wikipedia lets you interactively sort by the columns by incorporating a jquery sortkey in the table. This confuses the HTML parser in some instances (rank, land, water). It is mistakenly parsing the table-sorting code as part of the text. In the result, you can see the correct rank number in there, but it is prefixed by strings like !B9993068528194. Pandas string processing tools can recover this. Fixing the underlying parsing issue is probably up to the HTML parsers (html5lib, lxml, bs4) and not in scope for pandas.

So, to get your work done, add the kwargs skiprows=2, infer_types=False, and use string methods like Series.str.extract or Series.str.split to fix the rank, land, and water columns.

Can any other pandas folks comment on the plans for inter_types?

haraldschilly · 2014-07-26T14:13:40Z

Thanks for looking into this. After reporting this I dug a bit into this and also noticed those odd sort columns and so on. So, well, it's not really up to pandas, but indeed part of what there is. On the other hand, if this "table pattern" is indeed very common to wikipedia, it could be worthwhile to implement a "wikipedia" mode to the parser? (Which is activated automatically for *.wikipeida.org URLs) … Just an idea, it would be cool, that's all 1️⃣ 🆙 😀

cpcloud · 2014-07-26T14:21:09Z

Hm infer_types is supposed to be obviated by using TextParser (this is where read_csv goes eventually), but something is amiss here, I'm not totally sure why the dates are so "greedy". Let me take a look.

cpcloud · 2014-07-26T14:49:39Z

@danielballan thanks for the explanation!

@danielballan @haraldschilly

i put up a pr to fix this ... #7851, check it out at your leisure

jreback added the HTML label Jul 16, 2014

cpcloud added this to the 0.15.0 milestone Jul 26, 2014

cpcloud self-assigned this Jul 26, 2014

cpcloud added the Bug label Jul 26, 2014

cpcloud mentioned this issue Jul 26, 2014

BUG: fix greedy date parsing in read_html #7851

Merged

cpcloud closed this as completed in #7851 Jul 28, 2014

wesm unassigned cpcloud Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html doesn't work for wikipedia tables #7762

read_html doesn't work for wikipedia tables #7762

haraldschilly commented Jul 15, 2014

danielballan commented Jul 26, 2014

haraldschilly commented Jul 26, 2014

cpcloud commented Jul 26, 2014

cpcloud commented Jul 26, 2014

read_html doesn't work for wikipedia tables #7762

read_html doesn't work for wikipedia tables #7762

Comments

haraldschilly commented Jul 15, 2014

danielballan commented Jul 26, 2014

haraldschilly commented Jul 26, 2014

cpcloud commented Jul 26, 2014

cpcloud commented Jul 26, 2014