-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Added lxml-liberal html parsing flavor #5131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -606,6 +606,33 @@ def test_data_fail(self): | |
with tm.assertRaises(XMLSyntaxError): | ||
self.read_html(banklist_data, flavor=['lxml']) | ||
|
||
def test_lxml_liberal(self): | ||
banklist_data = os.path.join(DATA_PATH, 'banklist.html') | ||
|
||
dfs = self.read_html(banklist_data, flavor=['lxml-liberal']) | ||
for df in dfs: | ||
tm.assert_isinstance(df, DataFrame) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a pretty minimal test case, don't you have expectations for results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jtratner Well, TBI, the only guarantees that lxml makes with "There is also no guarantee that the resulting tree will and "It will not raise an exception on parser errors. " so I am not sure that I want to check anything other than no errors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the problem is that using non-strict There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud What exactly should I put in the test? That the parse from bs4 != parse from lxml-liberal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud bump There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can take a look later tonight There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's a possibility. I actually think because we don't know the specific reasons why So instead of just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud let me know if that looks okay There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpcloud Anything else to get this merged? |
||
self.assertFalse(df.empty) | ||
|
||
@slow | ||
def test_lxml_liberal2(self): | ||
_skip_if_no('bs4') | ||
banklist_data = os.path.join(DATA_PATH, 'banklist.html') | ||
|
||
dfs_lxml = self.read_html(banklist_data, flavor=['lxml-liberal']) | ||
dfs_bs4 = self.read_html(banklist_data, flavor=['bs4']) | ||
|
||
if len(dfs_lxml) != len(dfs_bs4): | ||
return | ||
|
||
for df_lxml,df_bs4 in zip(dfs_lxml, dfs_bs4): | ||
try: | ||
tm.assert_frame_equal(df_lxml,df_bs4) | ||
except AssertionError: | ||
return | ||
|
||
self.fail() | ||
|
||
def test_works_on_valid_markup(self): | ||
filename = os.path.join(DATA_PATH, 'valid_markup.html') | ||
dfs = self.read_html(filename, index_col=0, flavor=['lxml']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud Are you averse to my leaving this in? If you are then I can probably factor out
HTMLParser(recover=False)
to a new method likeget_parser
to make subclassing easier.