-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added lxml-liberal html parsing flavor #5131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
Warning | ||
------- | ||
This parser can only handle HTTP, FTP, and FILE urls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put this in Notes
@cpcloud Fixed |
|
||
dfs = self.read_html(banklist_data, flavor=['lxml-liberal']) | ||
for df in dfs: | ||
tm.assert_isinstance(df, DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty minimal test case, don't you have expectations for results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtratner Well, TBI, the only guarantees that lxml makes with recover=True
is that:
"There is also no guarantee that the resulting tree will
contain all data from the original document. "
and
"It will not raise an exception on parser errors. "
so I am not sure that I want to check anything other than no errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that using non-strict lxml
will be missing a row ... i don't remember exactly which one but @cancan101 you could check for that ... a little annoying to figure out which one it is but shouldn't be too hard with this new functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud What exactly should I put in the test? That the parse from bs4 != parse from lxml-liberal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look later tonight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the parse from bs4 != parse from lxml-liberal?
That's a possibility. I actually think because we don't know the specific reasons why lxml
will remove things without delving deep into its parser, that just testing that a non-empty DataFrame
is returned is okay.
So instead of just tm.assert_isinstance()
do that plus add something like self.assertFalse(df.empty)
and then add the bs4
comparison test and this should be okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud let me know if that looks okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud Anything else to get this merged?
Again, I think you can handle the whole thing with a lambda and without the long docstring (just add a note about what strict means to |
It seems like a good idea for there to be a general way of passing setting to the underlying parser. |
What other settings are there that you'd need? I'd like to keep this free of settings and have it "just work" for most tables ... @cancan101 Your tables happen to be very difficult (which is great for testing), but for the majority of cases, tweaking the parser is too low-level for most users IMHO |
What about changing the |
@cpcloud As of now, there are no other settings that I can think of. |
I am not sure what you get by saving off:
in the constructor. It means that you have to remake the parser for each new You could instead change this:
to:
|
not sure i totally follow you .... why don't you put up some code and we can take a look |
The idea was to move around when data is passed to the parser. The parser would then be constructed with settings and then the parse specific data (for example |
It is not central to the point about |
oh i c...fine with me ... just can't break any existing tests! |
- Use partial for lxml-liberal rather than a new class
@cpcloud Let me know what you think of the API change. Tests should pass (without any changes needed). |
If this is good, i will squash and rebase. |
bump? |
@cpcloud Let me know if there is anything left to do here? Then I can squash and rebase. |
bump ? |
1 similar comment
bump ? |
Any way to get this in 0.13? |
@cpcloud ? |
@cancan101 Sorry for the long delay. Been crazy busy. After mulling this over I'm going to have to say I'm 👎 on accepting this. I think that a parser that gives unpredictable output is limited in its utility, mainly because of the nature of the errors that are seen. Allowing that unpredictability is only going to create headaches for maintainers. My reasoning is a bit hand-wavy because it boils down to how much I trust these libs. I trust For example, I actually don't think I really really hope this does not discourage you from contributing more. You could certainly take the parser cleanup you've done here and open a separate PR! |
One of my concerns with bs4 is that on certainly pages, linked to on either As for unpredictability of results, I'm not sure that should be the If this pr does not get accepted than at the very least I do want to make
|
Additionally, if we add something, we then have to support it for the foreseeable future and allowing something that has undefined behavior is a big compatibility headache. We don't want to be responsible for lxml-liberal's output not changing if we make other changes to the module or how we approach the parsers. |
Correct me if I'm wrong, but that unpredictability can be addressed in the docs. I know I have seen bs4 do unpredictable things (beyond the infinite loop, I can try to dig up an example). |
@cancan101 why don't u change this to allow flavor to be a class/callable and if it is then use that as the parser. then in user land u can subclass the current parse, pass it in, r and make it do what I want ? |
I actually think I have an open issue to do just that. That solution is
|
👍 on allowing subclasses, then you can do whatever you want as long as you implement the interface @cancan101 It should be fairly clear how to overload from the internal docs and the way the class is structured. I'm happy to answer questions to that effect. |
@@ -469,6 +469,8 @@ class _LxmlFrameParser(_HtmlFrameParser): | |||
:class:`_HtmlFrameParser`. | |||
""" | |||
def __init__(self, *args, **kwargs): | |||
self.strict = kwargs.pop('strict', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud Are you averse to my leaving this in? If you are then I can probably factor out HTMLParser(recover=False)
to a new method like get_parser
to make subclassing easier.
as discussed above, not merging this. |
Closes #5130