Skip to content

ENH: Added lxml-liberal html parsing flavor #5131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,8 @@ Improvements to existing features
- :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table
from semi-structured JSON data. :ref:`See the docs<io.json_normalize>` (:issue:`1067`)
- ``DataFrame.from_records()`` will now accept generators (:issue:`4910`)

- Added ``lxml-liberal`` html parsing flavor (:issue:`5130`)

API Changes
~~~~~~~~~~~

Expand Down
48 changes: 46 additions & 2 deletions pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,8 @@ class _LxmlFrameParser(_HtmlFrameParser):
:class:`_HtmlFrameParser`.
"""
def __init__(self, *args, **kwargs):
self.strict = kwargs.pop('strict', True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud Are you averse to my leaving this in? If you are then I can probably factor out HTMLParser(recover=False) to a new method like get_parser to make subclassing easier.


super(_LxmlFrameParser, self).__init__(*args, **kwargs)

def _text_getter(self, obj):
Expand Down Expand Up @@ -519,7 +521,7 @@ def _build_doc(self):
from lxml.html import parse, fromstring, HTMLParser
from lxml.etree import XMLSyntaxError

parser = HTMLParser(recover=False)
parser = HTMLParser(recover=not self.strict)

try:
# try to parse the input in the simplest way
Expand Down Expand Up @@ -572,8 +574,49 @@ def _parse_raw_tfoot(self, table):
expr = './/tfoot//th'
return [_remove_whitespace(x.text_content()) for x in
table.xpath(expr)]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even more minor point: can u add one more newline here bc i am a pep8 fanatic :) not really .. but might as well try to keep to it as much as possible


class _LiberalLxmlFrameParser(_LxmlFrameParser):
"""HTML to DataFrame parser that uses lxml under the hood.

Tries hard to parse through broken XML.

Warning
-------
This parser can only handle HTTP, FTP, and FILE urls.

See Also
--------
_LxmlFrameParser
_HtmlFrameParser
_BeautifulSoupLxmlFrameParser

Notes
-----
It lets libxml2 try its best to return a valid HTML tree
with all content it can manage to parse.
It will not raise an exception on parser errors.
You should use libxml2 version 2.6.21 or newer
to take advantage of this feature.

The support for parsing broken HTML depends entirely on libxml2's
recovery algorithm.
It is not the fault of lxml if you find documents that
are so heavily broken that the parser cannot handle them.
There is also no guarantee that the resulting tree will
contain all data from the original document.
The parser may have to drop seriously broken parts when
struggling to keep parsing.
Especially misplaced meta tags can suffer from this,
which may lead to encoding problems.

Documentation strings for this class are in the base class
:class:`_HtmlFrameParser`.
"""

def __init__(self, *args, **kwargs):
super(_LiberalLxmlFrameParser, self).__init__(*args, strict=False, **kwargs)

def _expand_elements(body):
lens = Series(lmap(len, body))
lens_max = lens.max()
Expand Down Expand Up @@ -611,7 +654,8 @@ def _data_to_frame(data, header, index_col, skiprows, infer_types,

_valid_parsers = {'lxml': _LxmlFrameParser, None: _LxmlFrameParser,
'html5lib': _BeautifulSoupHtml5LibFrameParser,
'bs4': _BeautifulSoupHtml5LibFrameParser}
'bs4': _BeautifulSoupHtml5LibFrameParser,
'lxml-liberal': _LiberalLxmlFrameParser,}


def _parser_dispatch(flavor):
Expand Down
7 changes: 7 additions & 0 deletions pandas/io/tests/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,13 @@ def test_data_fail(self):
with tm.assertRaises(XMLSyntaxError):
self.read_html(banklist_data, flavor=['lxml'])

def test_lxml_liberal(self):
banklist_data = os.path.join(DATA_PATH, 'banklist.html')

dfs = self.read_html(banklist_data, flavor=['lxml-liberal'])
for df in dfs:
tm.assert_isinstance(df, DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty minimal test case, don't you have expectations for results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtratner Well, TBI, the only guarantees that lxml makes with recover=True is that:

"There is also no guarantee that the resulting tree will
contain all data from the original document. "

and

"It will not raise an exception on parser errors. "

so I am not sure that I want to check anything other than no errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that using non-strict lxml will be missing a row ... i don't remember exactly which one but @cancan101 you could check for that ... a little annoying to figure out which one it is but shouldn't be too hard with this new functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud What exactly should I put in the test? That the parse from bs4 != parse from lxml-liberal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look later tonight

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cancan101

That the parse from bs4 != parse from lxml-liberal?

That's a possibility. I actually think because we don't know the specific reasons why lxml will remove things without delving deep into its parser, that just testing that a non-empty DataFrame is returned is okay.

So instead of just tm.assert_isinstance() do that plus add something like self.assertFalse(df.empty) and then add the bs4 comparison test and this should be okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud let me know if that looks okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud Anything else to get this merged?


def test_works_on_valid_markup(self):
filename = os.path.join(DATA_PATH, 'valid_markup.html')
dfs = self.read_html(filename, index_col=0, flavor=['lxml'])
Expand Down