Suggestions for html table parsing #7220

klonuo · 2014-05-23T16:04:19Z

Hi,

I have lxml, and when trying to parse html with tables through pandas I found some problems:

HTMLParser is set to recover=False contrary to lxml's default value. This is unfortunate because parser will fail on any problem in html page, like missing closing tag or similar, which IMHO happens too often to be justifiable. If such strict rules should be considered, I'd suggest parsing html document with default parser values, and then apply this restriction only on table fragments.
HTMLParser will almost always deliver wrong content on tables encoded in anything other then ascii. There is no encoding argument in pandas.io.html.read_html() and lxml doesn't do magic unless http-equiv attribute is correctly declared in html document, or encoding argument is passed to HTMLParser.

The text was updated successfully, but these errors were encountered:

cpcloud · 2014-05-23T16:33:31Z

@klonuo Thanks for the comments.

For 2, I think that's a fairly simple fix, and if you want to submit a PR that would be most welcome. Otherwise I can put something up for v0.14.1.

For 1, we chose recover=False because with incorrect HTML lxml will actually drop data and we decided that slower parsing was more acceptable than data loss. The problem with selectively applying restrictions is that if you lose data in the initial parse, there's no way to recover it.

klonuo · 2014-05-23T16:51:51Z

@cpcloud Thanks for your prompt reply

I think it's best if you could provide fix, as I'm not familiar with pandas coding rules or internals. I just browsed html module, as I was getting errors, while parsing manually with lxml didn't have problems.

As for 1, I reported what I find unfortunate. Again I don't know libxml2 internals, but from what I read recoverer should be pretty advanced feature. IMHO disabling it shouldn't slow the parsing, nor enabling it should drop valuable data, but lets assume you did tests before applying it. Still missing closing paragraph tag wont let me see the table ;)

cpcloud · 2014-05-23T19:38:44Z

Yep we test everything and that is in fact how I discovered data were being dropped. I compared it with the results of the html5lib parser and the latter handles invalid HTML in a saner way than lxml. It was a conscious data-driven decision to have it behave this way.

klonuo · 2014-05-23T20:10:42Z

OK, your call. Maybe having it optional, defaulting to False, wont introduce too many arguments, while allowing users an option to parse non strict html instead failing.

cpcloud · 2014-06-03T02:26:02Z

@klonuo Fair enough, simple tables don't work. Thanks for the report.

cpcloud · 2014-06-03T02:26:18Z

In [48]: cat blah.html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>        Bãｃòｎ ìｐｓｕｍ Ꮷòｌｏｒ ｓíｔ ａｍëｔ</td>
      <td>        Bãｃòｎ ìｐｓｕｍ Ꮷòｌｏｒ ｓíｔ ａｍëｔ</td>
    </tr>
    <tr>
      <th>1</th>
      <td> Bɑɭｌ ｔíｐ ｃɑｐìｃｏｌã ｔｅｍρòｒ ρàｒìàｔｕｒ</td>
      <td> Bɑɭｌ ｔíｐ ｃɑｐìｃｏｌã ｔｅｍρòｒ ρàｒìàｔｕｒ</td>
    </tr>
  </tbody>
</table>
In [49]: pd.read_html('blah.html', index_col=0)
Out[49]:
[                                                   0  \
 0  BÃ£ï½�Ã²ï½� Ã¬ï½�ï½�ï½�ï½� á�§Ã²ï½�ï½�ï½� ï½�Ã...
 1  BÉ�Éï½� ï½�Ãï½� ï½�É�ï½�Ã¬ï½�ï½�ï½�Ã£ ï½�ï½�...

                                                    1
 0  BÃ£ï½�Ã²ï½� Ã¬ï½�ï½�ï½�ï½� á�§Ã²ï½�ï½�ï½� ï½�Ã...
 1  BÉ�Éï½� ï½�Ãï½� ï½�É�ï½�Ã¬ï½�ï½�ï½�Ã£ ï½�ï½�...  ]

jreback added Enhancement labels May 23, 2014

jreback added this to the 0.14.1 milestone May 23, 2014

cpcloud self-assigned this Jun 3, 2014

cpcloud mentioned this issue Jun 3, 2014

UNI/HTML/WIP: add encoding argument to read_html #7323

Merged

cpcloud closed this as completed in #7323 Jun 4, 2014

wesm unassigned cpcloud Oct 12, 2016

Amaelb mentioned this issue Nov 9, 2016

read_html() performance when HTML malformed #14312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for html table parsing #7220

Suggestions for html table parsing #7220

klonuo commented May 23, 2014

cpcloud commented May 23, 2014

klonuo commented May 23, 2014

cpcloud commented May 23, 2014

klonuo commented May 23, 2014

cpcloud commented Jun 3, 2014

cpcloud commented Jun 3, 2014

Suggestions for html table parsing #7220

Suggestions for html table parsing #7220

Comments

klonuo commented May 23, 2014

cpcloud commented May 23, 2014

klonuo commented May 23, 2014

cpcloud commented May 23, 2014

klonuo commented May 23, 2014

cpcloud commented Jun 3, 2014

cpcloud commented Jun 3, 2014