Skip to content

Suggestions for html table parsing #7220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
klonuo opened this issue May 23, 2014 · 6 comments · Fixed by #7323
Closed

Suggestions for html table parsing #7220

klonuo opened this issue May 23, 2014 · 6 comments · Fixed by #7323
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap Unicode Unicode strings
Milestone

Comments

@klonuo
Copy link
Contributor

klonuo commented May 23, 2014

Hi,

I have lxml, and when trying to parse html with tables through pandas I found some problems:

  1. HTMLParser is set to recover=False contrary to lxml's default value. This is unfortunate because parser will fail on any problem in html page, like missing closing tag or similar, which IMHO happens too often to be justifiable. If such strict rules should be considered, I'd suggest parsing html document with default parser values, and then apply this restriction only on table fragments.
  2. HTMLParser will almost always deliver wrong content on tables encoded in anything other then ascii. There is no encoding argument in pandas.io.html.read_html() and lxml doesn't do magic unless http-equiv attribute is correctly declared in html document, or encoding argument is passed to HTMLParser.
@cpcloud
Copy link
Member

cpcloud commented May 23, 2014

@klonuo Thanks for the comments.

For 2, I think that's a fairly simple fix, and if you want to submit a PR that would be most welcome. Otherwise I can put something up for v0.14.1.

For 1, we chose recover=False because with incorrect HTML lxml will actually drop data and we decided that slower parsing was more acceptable than data loss. The problem with selectively applying restrictions is that if you lose data in the initial parse, there's no way to recover it.

@jreback jreback added this to the 0.14.1 milestone May 23, 2014
@klonuo
Copy link
Contributor Author

klonuo commented May 23, 2014

@cpcloud Thanks for your prompt reply

I think it's best if you could provide fix, as I'm not familiar with pandas coding rules or internals. I just browsed html module, as I was getting errors, while parsing manually with lxml didn't have problems.

As for 1, I reported what I find unfortunate. Again I don't know libxml2 internals, but from what I read recoverer should be pretty advanced feature. IMHO disabling it shouldn't slow the parsing, nor enabling it should drop valuable data, but lets assume you did tests before applying it. Still missing closing paragraph tag wont let me see the table ;)

@cpcloud
Copy link
Member

cpcloud commented May 23, 2014

Yep we test everything and that is in fact how I discovered data were being dropped. I compared it with the results of the html5lib parser and the latter handles invalid HTML in a saner way than lxml. It was a conscious data-driven decision to have it behave this way.

@klonuo
Copy link
Contributor Author

klonuo commented May 23, 2014

OK, your call. Maybe having it optional, defaulting to False, wont introduce too many arguments, while allowing users an option to parse non strict html instead failing.

@cpcloud cpcloud self-assigned this Jun 3, 2014
@cpcloud
Copy link
Member

cpcloud commented Jun 3, 2014

@klonuo Fair enough, simple tables don't work. Thanks for the report.

@cpcloud
Copy link
Member

cpcloud commented Jun 3, 2014

In [48]: cat blah.html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>        Bãcòn ìpsum Ꮷòlor sít amët</td>
      <td>        Bãcòn ìpsum Ꮷòlor sít amët</td>
    </tr>
    <tr>
      <th>1</th>
      <td> Bɑɭl típ cɑpìcolã temρòr ρàrìàtur</td>
      <td> Bɑɭl típ cɑpìcolã temρòr ρàrìàtur</td>
    </tr>
  </tbody>
</table>
In [49]: pd.read_html('blah.html', index_col=0)
Out[49]:
[                                                   0  \
 0  Bãï½�òï½� ìï½�ï½�ï½�ï½� á�§Ã²ï½�ï½�ï½� ï½�Ã...
 1  B�ɭ� �í� ���ì���ã ��...

                                                    1
 0  Bãï½�òï½� ìï½�ï½�ï½�ï½� á�§Ã²ï½�ï½�ï½� ï½�Ã...
 1  B�ɭ� �í� ���ì���ã ��...  ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants