Skip to content

Make _LxmlFrameParser more extensible #5130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Oct 6, 2013 · 19 comments
Closed

Make _LxmlFrameParser more extensible #5130

cancan101 opened this issue Oct 6, 2013 · 19 comments
Labels
API Design IO Data IO issues that don't fit into a more specific label IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@cancan101
Copy link
Contributor

Allow creating a _LiberalLxmlFrameParser (see conversation on: #4469 (comment))

For now I will factor out:

parser = HTMLParser(recover=False)

so that subclass can change the behavior.

Would there be an issue with adding _LiberalLxmlFrameParser to the Pandas codebase?

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

no issues... would be happy to have your contrib... i think you could have a parameter that controls this like strict=True

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

marking for 0.14

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

@cancan101 FYI i'm going to assign myself to this issue to follow it so that it doens't get lost in the mix

@ghost ghost assigned cpcloud Oct 6, 2013
@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

not sure if that's visible to non repo collabs ... just letting you know

@cancan101
Copy link
Contributor Author

@cpcloud That's fine. Yes it does show up that it was assigned to you.

@cancan101
Copy link
Contributor Author

@cpcloud That reason to make another class (as opposed to just a parameter) is the valid parsers are specified as classes:

_valid_parsers = {'lxml': _LxmlFrameParser, None: _LxmlFrameParser,
                  'html5lib': _BeautifulSoupHtml5LibFrameParser,
                  'bs4': _BeautifulSoupHtml5LibFrameParser}

I suppose I could use a closure / partial (http://docs.python.org/2/library/functools.html#functools.partial) but that seems a bit messy.

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

class is fine ... i was thinking about the user facing API

@cancan101
Copy link
Contributor Author

ah. Actually I could create a method with signature func(io, match, attrs):

def _LiberalLxmlFrameParser(io, match, attrs):
   return _LxmlFrameParser(io, match, attrs, strict=False)

and then:

_valid_parsers = {'lxml': _LxmlFrameParser, None: _LxmlFrameParser,
                  'html5lib': _BeautifulSoupHtml5LibFrameParser,
                  'bs4': _BeautifulSoupHtml5LibFrameParser,
                  'lxml-liberal': _LiberalLxmlFrameParser,}

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

i would prefer if you didn't mix my class naming convention with function names

@jtratner
Copy link
Contributor

jtratner commented Oct 6, 2013

Much simpler: 'lxml-liberal': lambda io, match, attrs: _LxmlFrameParser(io, match, attrs, strict=False)

@cancan101
Copy link
Contributor Author

@jtratner I had initially justed used a method rather than a class (albeit it was not anonymous). That being said, having the class allows me to put notes / docs. See: cancan101@91e7e5c#diff-a768e0e688020658ae7fb0dc23609b78L576

@jtratner
Copy link
Contributor

jtratner commented Oct 6, 2013

These are all internal classes, nobody will see the docstring or documentation. Moreover, you can really just summarize it as:

lxml-liberal - uses lxml parser but allows errors to pass silently and then returns what it can from the parsed tables that lxml is able to find.

Also, I just learned you can set kewyord argument defaults on lambdas, so can do:

'lxml-liberal': lambda io, match, attrs, strict=False, **kwargs: _LxmlFrameParser(io, match, attrs, strict=strict, **kwargs)

@jtratner
Copy link
Contributor

jtratner commented Oct 6, 2013

heck, you could even do this:

'lxml-liberal': partial(_LxmlFrameParser, strict=True)

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

i think @cancan101 was going to do that originally...

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

that lambda expression is a bit large ... :)

@cpcloud
Copy link
Member

cpcloud commented Oct 6, 2013

that technique is useful for passing closures around too

@cancan101
Copy link
Contributor Author

Yep! #5130 (comment)

@cancan101
Copy link
Contributor Author

Ok how about the partial approach and then a note under flavor? Should I copy what I have now for notes or should I strip it down? Most of what I have came directly from lxml docs.

@cpcloud
Copy link
Member

cpcloud commented Jan 29, 2014

Closing ... getting a bit stale ... and again should be for maybe a more in-depth html-pandas-parsing library.

@cpcloud cpcloud closed this as completed Jan 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Data IO issues that don't fit into a more specific label IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants