ENH: Added lxml-liberal html parsing flavor #5131

cancan101 · 2013-10-06T20:51:31Z

cpcloud · 2013-10-06T20:58:17Z

pandas/io/html.py

+
+    Warning
+    -------
+    This parser can only handle HTTP, FTP, and FILE urls.


you can put this in Notes

cancan101 · 2013-10-06T21:11:57Z

@cpcloud Fixed

jtratner · 2013-10-06T21:41:35Z

pandas/io/tests/test_html.py

+
+        dfs = self.read_html(banklist_data, flavor=['lxml-liberal'])
+        for df in dfs:
+            tm.assert_isinstance(df, DataFrame)


this is a pretty minimal test case, don't you have expectations for results?

@jtratner Well, TBI, the only guarantees that lxml makes with recover=True is that:

"There is also no guarantee that the resulting tree will
contain all data from the original document. "

and

"It will not raise an exception on parser errors. "

so I am not sure that I want to check anything other than no errors.

the problem is that using non-strict lxml will be missing a row ... i don't remember exactly which one but @cancan101 you could check for that ... a little annoying to figure out which one it is but shouldn't be too hard with this new functionality

@cpcloud What exactly should I put in the test? That the parse from bs4 != parse from lxml-liberal?

@cpcloud bump

I can take a look later tonight

@cancan101

That the parse from bs4 != parse from lxml-liberal?

That's a possibility. I actually think because we don't know the specific reasons why lxml will remove things without delving deep into its parser, that just testing that a non-empty DataFrame is returned is okay.

So instead of just tm.assert_isinstance() do that plus add something like self.assertFalse(df.empty) and then add the bs4 comparison test and this should be okay

@cpcloud let me know if that looks okay

@cpcloud Anything else to get this merged?

jtratner · 2013-10-06T21:43:10Z

Again, I think you can handle the whole thing with a lambda and without the long docstring (just add a note about what strict means to _LxmlFrameParser). What's the interface for the parsers? Are they all 3 attribute + potentially strict as a keyword, or can they have more?

cancan101 · 2013-10-06T21:46:06Z

It seems like a good idea for there to be a general way of passing setting to the underlying parser.

cpcloud · 2013-10-06T21:48:10Z

What other settings are there that you'd need? I'd like to keep this free of settings and have it "just work" for most tables ... @cancan101 Your tables happen to be very difficult (which is great for testing), but for the majority of cases, tweaking the parser is too low-level for most users IMHO

cancan101 · 2013-10-06T21:48:14Z

What about changing the parse_tables tables call to take match and attrs and leaving just io and any other parser settings on the init method?

cancan101 · 2013-10-06T21:49:13Z

@cpcloud As of now, there are no other settings that I can think of.

cancan101 · 2013-10-06T21:55:21Z

@jtratner @cpcloud Heck, you could even move io, match and attrs all to parse_tables and leave the __init__ for JUST the parser settings (ie zero arguments for most classes).

cancan101 · 2013-10-06T21:57:11Z

I am not sure what you get by saving off:

        self.io = io
        self.match = match
        self.attrs = attrs

in the constructor. It means that you have to remake the parser for each new io.

You could instead change this:

       p = parser(io, compiled_match, attrs)

        try:
            tables = p.parse_tables()

to:

       p = parser()

        try:
            tables = p.parse_tables(io, compiled_match, attrs)

cpcloud · 2013-10-06T22:00:16Z

not sure i totally follow you .... why don't you put up some code and we can take a look

cancan101 · 2013-10-06T22:02:11Z

The idea was to move around when data is passed to the parser. The parser would then be constructed with settings and then the parse specific data (for example io) would be passed on the call to parse_tables. This would make it more clear what is a setting for the parser and what is the data the parser is applied to.

cancan101 · 2013-10-06T22:03:03Z

It is not central to the point about strict just some musings on how to clean up the parser API.

cpcloud · 2013-10-06T22:04:43Z

oh i c...fine with me ... just can't break any existing tests!

- Use partial for lxml-liberal rather than a new class

cancan101 · 2013-10-07T02:02:03Z

@cpcloud Let me know what you think of the API change. Tests should pass (without any changes needed).

cancan101 · 2013-10-09T19:14:25Z

If this is good, i will squash and rebase.

cancan101 · 2013-10-15T03:49:24Z

bump?

cancan101 · 2013-10-20T15:18:21Z

@cpcloud Let me know if there is anything left to do here? Then I can squash and rebase.

cancan101 · 2013-10-23T00:40:35Z

bump ?

cancan101 · 2013-10-26T23:11:09Z

bump ?

cancan101 · 2013-10-29T23:57:15Z

Any way to get this in 0.13?

jreback · 2013-10-30T00:03:53Z

@cpcloud ?

cpcloud · 2013-10-30T00:47:08Z

@cancan101 Sorry for the long delay. Been crazy busy.

After mulling this over I'm going to have to say I'm 👎 on accepting this.

I think that a parser that gives unpredictable output is limited in its utility, mainly because of the nature of the errors that are seen. Allowing that unpredictability is only going to create headaches for maintainers.

My reasoning is a bit hand-wavy because it boils down to how much I trust these libs. I trust html5lib + bs4 more than I trust lxml, simply from my experience using them. Keep in mind that a it is up to the implementer(s) of an HTML parser to decide what happens when a parse fails.

For example, lxml drops tags without warning (by default, no less) from the FDIC bank list data set, where htmllib + bs4 does what you would expect. lxml works fine on HTML such as that from DataFrame.to_html() (technically not valid HTML), and on valid HTML.

I actually don't think lxml is useful unless you're using it to parse XML, which in my experience is more likely to be correct (FWIW, I used to think lxml was the bee's knees until I had to spend some time figuring out why it wasn't giving me sane output from admittedly insane input). I would be in favor of removing lxml entirely, but for back compat we can't.

I really really hope this does not discourage you from contributing more. You could certainly take the parser cleanup you've done here and open a separate PR!

cancan101 · 2013-10-30T00:54:25Z

One of my concerns with bs4 is that on certainly pages, linked to on either
this pr or another issue, it enters an infinite loop.

As for unpredictability of results, I'm not sure that should be the
deciding factor. I can only imagine when and if a pdf parser is written in
that regard.

If this pr does not get accepted than at the very least I do want to make
sure that I can use this same code from "user land" because I do need it to
parse pages that the other parsers can't handle.
On Oct 29, 2013 8:47 PM, "Phillip Cloud" [email protected] wrote:

@cancan101 https://github.com/cancan101 Sorry for the long delay. Been
crazy busy.

After mulling this over I'm going to have to say I'm [image: 👎] on
accepting this.

I think that a parser that gives unpredictable output is limited in its
utility, mainly because of the nature of the errors that are seen. Allowing
that unpredictability is only going to create headaches for maintainers.

My reasoning is a bit hand-wavy because it boils down to how much I trust
these libs. I trust html5lib + bs4 more than I trust lxml, simply from my
experience using them. Keep in mind that a it is up to the implementer(s)
of an HTML parser to decide what happens when a parse fails.

For example, lxml drops tags without warning (by default, no less) from the
FDIC bank list data sethttp://www.fdic.gov/bank/individual/failed/banklist.html,
where htmllib + bs4 does what you would expect. lxml works fine on HTML
such as that from DataFrame.to_html() (technically not valid HTML), and
on valid HTML.

I actually don't think lxml is useful unless you're using it to parse
XML, which in my experience is more likely to be correct (FWIW, I used to
think lxml was the bee's knees until I had to spend some time figuring
out why it wasn't giving me sane output from admittedly insane input). I
would be in favor of removing lxml entirely, but for back compat we can't.

I really really hope this does not discourage you from contributing
more. You could certainly take the parser cleanup you've done here and open
a separate PR!

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5131#issuecomment-27357008
.

jtratner · 2013-10-30T00:57:07Z

Additionally, if we add something, we then have to support it for the foreseeable future and allowing something that has undefined behavior is a big compatibility headache. We don't want to be responsible for lxml-liberal's output not changing if we make other changes to the module or how we approach the parsers.

cancan101 · 2013-10-30T00:59:55Z

Correct me if I'm wrong, but that unpredictability can be addressed in the docs. I know I have seen bs4 do unpredictable things (beyond the infinite loop, I can try to dig up an example).

jreback · 2013-10-30T01:03:28Z

@cancan101 why don't u change this to allow flavor to be a class/callable and if it is then use that as the parser. then in user land u can subclass the current parse, pass it in, r and make it do what I want ?

cancan101 · 2013-10-30T01:06:46Z

I actually think I have an open issue to do just that. That solution is
fine as long as long the current lxml parser class is reasonably
extensible.
On Oct 29, 2013 9:03 PM, "jreback" [email protected] wrote:

@cancan101 https://github.com/cancan101 why don't u change this to
allow flavor to be a class/callable and if it is then use that as the
parser. then in user land u can subclass the current parse, pass it in, r
and make it do what I want ?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5131#issuecomment-27357701
.

cpcloud · 2013-10-30T01:15:58Z

👍 on allowing subclasses, then you can do whatever you want as long as you implement the interface

@cancan101 It should be fairly clear how to overload from the internal docs and the way the class is structured. I'm happy to answer questions to that effect.

cancan101 · 2013-10-30T12:27:03Z

@cpcloud The subclassing is easy (most of the work was done in this PR). The bulk of the remaining work is #4594.

cancan101 · 2013-10-30T23:33:55Z

pandas/io/html.py

@@ -469,6 +469,8 @@ class _LxmlFrameParser(_HtmlFrameParser):
    :class:`_HtmlFrameParser`.
    """
    def __init__(self, *args, **kwargs):
+        self.strict = kwargs.pop('strict', True)


@cpcloud Are you averse to my leaving this in? If you are then I can probably factor out HTMLParser(recover=False) to a new method like get_parser to make subclassing easier.

jtratner · 2013-11-01T00:23:43Z

as discussed above, not merging this.

cpcloud reviewed Oct 6, 2013
View reviewed changes

pandas/io/html.py

Warning

-------

This parser can only handle HTTP, FTP, and FILE urls.

Copy link

Member

cpcloud Oct 6, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put this in Notes

ENH: Added lxml-liberal html parsing flavor (pandas-dev#5130)

91e7e5c

jtratner reviewed Oct 6, 2013
View reviewed changes

- Changed API for HTML parsers

46c3fe8

- Use partial for lxml-liberal rather than a new class

ghost assigned cpcloud Oct 8, 2013

Additional tests

daad561

cancan101 reviewed Oct 30, 2013
View reviewed changes

cancan101 mentioned this pull request Oct 31, 2013

HTML Parsing Cleanup #5395

Closed

jtratner closed this Nov 1, 2013

cancan101 mentioned this pull request Nov 9, 2013

io.html.read_html support XPath expressions for table selection #5416

Closed

ENH: Added lxml-liberal html parsing flavor #5131

ENH: Added lxml-liberal html parsing flavor #5131

Conversation

cancan101 commented Oct 6, 2013

Choose a reason for hiding this comment

cancan101 commented Oct 6, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtratner commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cpcloud commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cpcloud commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cancan101 commented Oct 6, 2013

cpcloud commented Oct 6, 2013

cancan101 commented Oct 7, 2013

cancan101 commented Oct 9, 2013

cancan101 commented Oct 15, 2013

cancan101 commented Oct 20, 2013

cancan101 commented Oct 23, 2013

cancan101 commented Oct 26, 2013

cancan101 commented Oct 29, 2013

jreback commented Oct 30, 2013

cpcloud commented Oct 30, 2013

cancan101 commented Oct 30, 2013

jtratner commented Oct 30, 2013

cancan101 commented Oct 30, 2013

jreback commented Oct 30, 2013

cancan101 commented Oct 30, 2013

cpcloud commented Oct 30, 2013

cancan101 commented Oct 30, 2013

Choose a reason for hiding this comment

jtratner commented Nov 1, 2013