Skip to content

ENH: Added lxml-liberal html parsing flavor #5131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

cancan101
Copy link
Contributor

Closes #5130


Warning
-------
This parser can only handle HTTP, FTP, and FILE urls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put this in Notes

@cancan101
Copy link
Contributor Author

@cpcloud Fixed


dfs = self.read_html(banklist_data, flavor=['lxml-liberal'])
for df in dfs:
tm.assert_isinstance(df, DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty minimal test case, don't you have expectations for results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner Well, TBI, the only guarantees that lxml makes with recover=True is that:

"There is also no guarantee that the resulting tree will
contain all data from the original document. "

and

"It will not raise an exception on parser errors. "

so I am not sure that I want to check anything other than no errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that using non-strict lxml will be missing a row ... i don't remember exactly which one but @cancan101 you could check for that ... a little annoying to figure out which one it is but shouldn't be too hard with this new functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud What exactly should I put in the test? That the parse from bs4 != parse from lxml-liberal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look later tonight

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cancan101

That the parse from bs4 != parse from lxml-liberal?

That's a possibility. I actually think because we don't know the specific reasons why lxml will remove things without delving deep into its parser, that just testing that a non-empty DataFrame is returned is okay.

So instead of just tm.assert_isinstance() do that plus add something like self.assertFalse(df.empty) and then add the bs4 comparison test and this should be okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud let me know if that looks okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud Anything else to get this merged?

@jtratner
Copy link
Contributor

jtratner commented Oct 6, 2013

Again, I think you can handle the whole thing with a lambda and without the long docstring (just add a note about what strict means to _LxmlFrameParser). What's the interface for the parsers? Are they all 3 attribute + potentially strict as a keyword, or can they have more?

@cancan101
Copy link
Contributor Author

It seems like a good idea for there to be a general way of passing setting to the underlying parser.

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

What other settings are there that you'd need? I'd like to keep this free of settings and have it "just work" for most tables ... @cancan101 Your tables happen to be very difficult (which is great for testing), but for the majority of cases, tweaking the parser is too low-level for most users IMHO

@cancan101
Copy link
Contributor Author

What about changing the parse_tables tables call to take match and attrs and leaving just io and any other parser settings on the init method?

@cancan101
Copy link
Contributor Author

@cpcloud As of now, there are no other settings that I can think of.

@cancan101
Copy link
Contributor Author

@jtratner @cpcloud Heck, you could even move io, match and attrs all to parse_tables and leave the __init__ for JUST the parser settings (ie zero arguments for most classes).

@cancan101
Copy link
Contributor Author

I am not sure what you get by saving off:

        self.io = io
        self.match = match
        self.attrs = attrs

in the constructor. It means that you have to remake the parser for each new io.

You could instead change this:

       p = parser(io, compiled_match, attrs)

        try:
            tables = p.parse_tables()

to:

       p = parser()

        try:
            tables = p.parse_tables(io, compiled_match, attrs)

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

not sure i totally follow you .... why don't you put up some code and we can take a look

@cancan101
Copy link
Contributor Author

The idea was to move around when data is passed to the parser. The parser would then be constructed with settings and then the parse specific data (for example io) would be passed on the call to parse_tables. This would make it more clear what is a setting for the parser and what is the data the parser is applied to.

@cancan101
Copy link
Contributor Author

It is not central to the point about strict just some musings on how to clean up the parser API.

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

oh i c...fine with me ... just can't break any existing tests!

- Use partial for lxml-liberal rather than a new class
@cancan101
Copy link
Contributor Author

@cpcloud Let me know what you think of the API change. Tests should pass (without any changes needed).

@ghost ghost assigned cpcloud Oct 8, 2013
@cancan101
Copy link
Contributor Author

If this is good, i will squash and rebase.

@cancan101
Copy link
Contributor Author

bump?

@cancan101
Copy link
Contributor Author

@cpcloud Let me know if there is anything left to do here? Then I can squash and rebase.

@cancan101
Copy link
Contributor Author

bump ?

1 similar comment
@cancan101
Copy link
Contributor Author

bump ?

@cancan101
Copy link
Contributor Author

Any way to get this in 0.13?

@jreback
Copy link
Contributor

jreback commented Oct 30, 2013

@cpcloud ?

@cpcloud
Copy link
Member

cpcloud commented Oct 30, 2013

@cancan101 Sorry for the long delay. Been crazy busy.

After mulling this over I'm going to have to say I'm 👎 on accepting this.

I think that a parser that gives unpredictable output is limited in its utility, mainly because of the nature of the errors that are seen. Allowing that unpredictability is only going to create headaches for maintainers.

My reasoning is a bit hand-wavy because it boils down to how much I trust these libs. I trust html5lib + bs4 more than I trust lxml, simply from my experience using them. Keep in mind that a it is up to the implementer(s) of an HTML parser to decide what happens when a parse fails.

For example, lxml drops tags without warning (by default, no less) from the FDIC bank list data set, where htmllib + bs4 does what you would expect. lxml works fine on HTML such as that from DataFrame.to_html() (technically not valid HTML), and on valid HTML.

I actually don't think lxml is useful unless you're using it to parse XML, which in my experience is more likely to be correct (FWIW, I used to think lxml was the bee's knees until I had to spend some time figuring out why it wasn't giving me sane output from admittedly insane input). I would be in favor of removing lxml entirely, but for back compat we can't.

I really really hope this does not discourage you from contributing more. You could certainly take the parser cleanup you've done here and open a separate PR!

@cancan101
Copy link
Contributor Author

One of my concerns with bs4 is that on certainly pages, linked to on either
this pr or another issue, it enters an infinite loop.

As for unpredictability of results, I'm not sure that should be the
deciding factor. I can only imagine when and if a pdf parser is written in
that regard.

If this pr does not get accepted than at the very least I do want to make
sure that I can use this same code from "user land" because I do need it to
parse pages that the other parsers can't handle.
On Oct 29, 2013 8:47 PM, "Phillip Cloud" [email protected] wrote:

@cancan101 https://github.com/cancan101 Sorry for the long delay. Been
crazy busy.

After mulling this over I'm going to have to say I'm [image: 👎] on
accepting this.

I think that a parser that gives unpredictable output is limited in its
utility, mainly because of the nature of the errors that are seen. Allowing
that unpredictability is only going to create headaches for maintainers.

My reasoning is a bit hand-wavy because it boils down to how much I trust
these libs. I trust html5lib + bs4 more than I trust lxml, simply from my
experience using them. Keep in mind that a it is up to the implementer(s)
of an HTML parser to decide what happens when a parse fails.

For example, lxml drops tags without warning (by default, no less) from the
FDIC bank list data sethttp://www.fdic.gov/bank/individual/failed/banklist.html,
where htmllib + bs4 does what you would expect. lxml works fine on HTML
such as that from DataFrame.to_html() (technically not valid HTML), and
on valid HTML.

I actually don't think lxml is useful unless you're using it to parse
XML, which in my experience is more likely to be correct (FWIW, I used to
think lxml was the bee's knees until I had to spend some time figuring
out why it wasn't giving me sane output from admittedly insane input). I
would be in favor of removing lxml entirely, but for back compat we can't.

I really really hope this does not discourage you from contributing
more. You could certainly take the parser cleanup you've done here and open
a separate PR!


Reply to this email directly or view it on GitHubhttps://github.com//pull/5131#issuecomment-27357008
.

@jtratner
Copy link
Contributor

Additionally, if we add something, we then have to support it for the foreseeable future and allowing something that has undefined behavior is a big compatibility headache. We don't want to be responsible for lxml-liberal's output not changing if we make other changes to the module or how we approach the parsers.

@cancan101
Copy link
Contributor Author

Correct me if I'm wrong, but that unpredictability can be addressed in the docs. I know I have seen bs4 do unpredictable things (beyond the infinite loop, I can try to dig up an example).

@jreback
Copy link
Contributor

jreback commented Oct 30, 2013

@cancan101 why don't u change this to allow flavor to be a class/callable and if it is then use that as the parser. then in user land u can subclass the current parse, pass it in, r and make it do what I want ?

@cancan101
Copy link
Contributor Author

I actually think I have an open issue to do just that. That solution is
fine as long as long the current lxml parser class is reasonably
extensible.
On Oct 29, 2013 9:03 PM, "jreback" [email protected] wrote:

@cancan101 https://github.com/cancan101 why don't u change this to
allow flavor to be a class/callable and if it is then use that as the
parser. then in user land u can subclass the current parse, pass it in, r
and make it do what I want ?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5131#issuecomment-27357701
.

@cpcloud
Copy link
Member

cpcloud commented Oct 30, 2013

👍 on allowing subclasses, then you can do whatever you want as long as you implement the interface

@cancan101 It should be fairly clear how to overload from the internal docs and the way the class is structured. I'm happy to answer questions to that effect.

@cancan101
Copy link
Contributor Author

@cpcloud The subclassing is easy (most of the work was done in this PR). The bulk of the remaining work is #4594.

@@ -469,6 +469,8 @@ class _LxmlFrameParser(_HtmlFrameParser):
:class:`_HtmlFrameParser`.
"""
def __init__(self, *args, **kwargs):
self.strict = kwargs.pop('strict', True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud Are you averse to my leaving this in? If you are then I can probably factor out HTMLParser(recover=False) to a new method like get_parser to make subclassing easier.

@cancan101 cancan101 mentioned this pull request Oct 31, 2013
@jtratner
Copy link
Contributor

jtratner commented Nov 1, 2013

as discussed above, not merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make _LxmlFrameParser more extensible
4 participants