-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added lxml-liberal html parsing flavor #5131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -469,6 +469,8 @@ class _LxmlFrameParser(_HtmlFrameParser): | |
:class:`_HtmlFrameParser`. | ||
""" | ||
def __init__(self, *args, **kwargs): | ||
self.strict = kwargs.pop('strict', True) | ||
|
||
super(_LxmlFrameParser, self).__init__(*args, **kwargs) | ||
|
||
def _text_getter(self, obj): | ||
|
@@ -519,7 +521,7 @@ def _build_doc(self): | |
from lxml.html import parse, fromstring, HTMLParser | ||
from lxml.etree import XMLSyntaxError | ||
|
||
parser = HTMLParser(recover=False) | ||
parser = HTMLParser(recover=not self.strict) | ||
|
||
try: | ||
# try to parse the input in the simplest way | ||
|
@@ -572,8 +574,49 @@ def _parse_raw_tfoot(self, table): | |
expr = './/tfoot//th' | ||
return [_remove_whitespace(x.text_content()) for x in | ||
table.xpath(expr)] | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. even more minor point: can u add one more newline here bc i am a pep8 fanatic :) not really .. but might as well try to keep to it as much as possible |
||
|
||
class _LiberalLxmlFrameParser(_LxmlFrameParser): | ||
"""HTML to DataFrame parser that uses lxml under the hood. | ||
|
||
Tries hard to parse through broken XML. | ||
|
||
Warning | ||
------- | ||
This parser can only handle HTTP, FTP, and FILE urls. | ||
|
||
See Also | ||
-------- | ||
_LxmlFrameParser | ||
_HtmlFrameParser | ||
_BeautifulSoupLxmlFrameParser | ||
|
||
Notes | ||
----- | ||
It lets libxml2 try its best to return a valid HTML tree | ||
with all content it can manage to parse. | ||
It will not raise an exception on parser errors. | ||
You should use libxml2 version 2.6.21 or newer | ||
to take advantage of this feature. | ||
|
||
The support for parsing broken HTML depends entirely on libxml2's | ||
recovery algorithm. | ||
It is not the fault of lxml if you find documents that | ||
are so heavily broken that the parser cannot handle them. | ||
There is also no guarantee that the resulting tree will | ||
contain all data from the original document. | ||
The parser may have to drop seriously broken parts when | ||
struggling to keep parsing. | ||
Especially misplaced meta tags can suffer from this, | ||
which may lead to encoding problems. | ||
|
||
Documentation strings for this class are in the base class | ||
:class:`_HtmlFrameParser`. | ||
""" | ||
|
||
def __init__(self, *args, **kwargs): | ||
super(_LiberalLxmlFrameParser, self).__init__(*args, strict=False, **kwargs) | ||
|
||
def _expand_elements(body): | ||
lens = Series(lmap(len, body)) | ||
lens_max = lens.max() | ||
|
@@ -611,7 +654,8 @@ def _data_to_frame(data, header, index_col, skiprows, infer_types, | |
|
||
_valid_parsers = {'lxml': _LxmlFrameParser, None: _LxmlFrameParser, | ||
'html5lib': _BeautifulSoupHtml5LibFrameParser, | ||
'bs4': _BeautifulSoupHtml5LibFrameParser} | ||
'bs4': _BeautifulSoupHtml5LibFrameParser, | ||
'lxml-liberal': _LiberalLxmlFrameParser,} | ||
|
||
|
||
def _parser_dispatch(flavor): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -606,6 +606,13 @@ def test_data_fail(self): | |
with tm.assertRaises(XMLSyntaxError): | ||
self.read_html(banklist_data, flavor=['lxml']) | ||
|
||
def test_lxml_liberal(self): | ||
banklist_data = os.path.join(DATA_PATH, 'banklist.html') | ||
|
||
dfs = self.read_html(banklist_data, flavor=['lxml-liberal']) | ||
for df in dfs: | ||
tm.assert_isinstance(df, DataFrame) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a pretty minimal test case, don't you have expectations for results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jtratner Well, TBI, the only guarantees that lxml makes with "There is also no guarantee that the resulting tree will and "It will not raise an exception on parser errors. " so I am not sure that I want to check anything other than no errors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the problem is that using non-strict There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud What exactly should I put in the test? That the parse from bs4 != parse from lxml-liberal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud bump There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can take a look later tonight There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's a possibility. I actually think because we don't know the specific reasons why So instead of just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud let me know if that looks okay There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud Anything else to get this merged? |
||
|
||
def test_works_on_valid_markup(self): | ||
filename = os.path.join(DATA_PATH, 'valid_markup.html') | ||
dfs = self.read_html(filename, index_col=0, flavor=['lxml']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud Are you averse to my leaving this in? If you are then I can probably factor out
HTMLParser(recover=False)
to a new method likeget_parser
to make subclassing easier.