io.html.read_html support XPath expressions for table selection #5416

phaebz · 2013-11-02T19:15:04Z

First tentative PR as discussed in #5389

TomAugspurger · 2013-11-02T19:50:17Z

Could you hook up Travis? There's some instructions here

Just holler if you run into any issues.

phaebz · 2013-11-02T20:24:30Z

I have a question concerning the first travis-ci build. In the docs here it says that the first build has to be done with a push to the repo. Is there any other way? A dummy commit+push just for ci testing would be ugly and reforking, patching, PRing tedious...

TomAugspurger · 2013-11-02T20:53:46Z

Can you just push this commit again? git commit -C HEAD --amend and then force push to this branch.

phaebz · 2013-11-02T20:58:35Z

Build is running, thanks!

jtratner · 2013-11-02T22:28:14Z

Just make sure you cover at least: invalid xpath, non-existent xpath, xpath
matching multiple, valid xpath matching None. Xpath that accidentally
references within a table (which should probably fail) and xpath refing
table directly. Looks like you've covered some of these cases.

phaebz · 2013-11-03T12:59:22Z

I covered the needed test cases. I thought about "Xpath that accidentally references within a table" and generalized to "XPath does not return node list only containing tables". Please review.

Also, the travis build for python 2.6 fails. Importing lxml.etree.XPathEvalError fails with

ImportError: No module named lxml.etree

I never worked with 2.6.x Python lxml, so how to solve this?

Edit: direct link

… lxml parser atm)

jtratner · 2013-11-03T15:45:02Z

The 2.6 build doesn't have lxml, you need to guard your test cases to check if lxml is installed (look at other test cases /modules to see how this is done).

jtratner · 2013-11-03T15:46:12Z

@phaebz so the contract for this is: "The XPath must return a set of table nodes and only table nodes"?

phaebz · 2013-11-03T15:49:27Z

@jtratner Yes, that was my idea.

Regarding lxml checking in tests, I will look into it.

jtratner · 2013-11-03T15:50:44Z

it's _skip_if_no('lxml')

phaebz · 2013-11-03T15:56:06Z

In the setUpClass method of TestReadHtmlLxml it says exactly that. This is why I originally thought that the lxml installed on the travis python 2.6 had a different module organisation, i.e. no lxml.etree.

jtratner · 2013-11-03T16:14:53Z

pandas/io/tests/test_html.py

@@ -18,6 +18,8 @@
 from numpy.random import rand
 from numpy.testing.decorators import slow

+from lxml.etree import XPathEvalError


this is the problem - need to guard this import.

I did it with a

try: from lxml.etree import XPathEvalError except ImportError: pass

thanks for pushing my nose into it.

no problem - it can be confusing to figure out how everything should work
with a big codebase like pandas.

phaebz · 2013-11-03T18:20:49Z

Finally, all builds pass. What do you think?

jreback · 2013-11-03T18:38:47Z

don't u need to check if _HAVE_LXML is False and xpath is not None? (in the code I mean) and raise some kind of ValueError?

cancan101 · 2013-11-03T18:43:23Z

pandas/io/html.py

@@ -31,6 +31,7 @@

 try:
    import lxml
+    from lxml.etree import XPathEvalError


This import is not used

Grr, this one sneaked in while working on the tests...

cancan101 · 2013-11-03T18:44:16Z

You probably want to skip the xpath tests if import lxml fails

phaebz · 2013-11-03T20:02:04Z

@cancan101 See my post above, this should be taken care of in the setUpClass method of TestReadHtmlLxml.

phaebz · 2013-11-03T20:11:08Z

@jreback I am not sure I understand you correctly. In _parse I raise a NotImplementedError if xpath is not None and flavor is one of bs4 or html5lib - so as to indicate that someone could implement it for these flavors if needed. Is this too indirect?

cancan101 · 2013-11-03T20:13:07Z

@phaebz He is saying that you never use or check _HAS_LXML

cancan101 · 2013-11-03T20:14:01Z

And that if _HAS_LXML=False and the user has set an xpath, there should be an error to indicate that parsing will not succeed.

jreback · 2013-11-03T20:14:45Z

@phaebz what @cancan101 said!

phaebz · 2013-11-03T21:17:01Z

I put this check into the _parse function, is there a better place?

phaebz · 2013-11-04T09:20:01Z

Just to be clear, in my previous post I was referring to the check you proposed (xpath and _HAS_LXML=False) which was amended to the last commit.

cancan101 · 2013-11-04T13:25:21Z

@phaebz Now I see where _HAS_LXML is being checked, but now where is _HAS_LXML being set?

phaebz · 2013-11-04T13:40:20Z

@cancan101 Isn't this done here?

cancan101 · 2013-11-04T14:34:57Z

Yes, my bad.

jreback · 2013-11-05T23:51:51Z

@phaebz can you squash down to a single commit?

@cpcloud ok?

cpcloud · 2013-11-07T15:49:26Z

pandas/io/html.py


-        tables = doc.xpath(xpath_expr, namespaces=_re_namespace)
+        else:


I think it would be nice to dump the below into a function...but not necessary for this PR.

cpcloud · 2013-11-07T15:50:42Z

@phaebz Needs release notes and an example or two. After that, good to go.

jreback · 2013-11-07T16:00:02Z

@phaebz ok...going to mark this for 0.13....

phaebz · 2013-11-08T10:59:59Z

@cpcloud Did the easy/fast bits (not yet pushed). Will have a look and finish on the weekend.
@jreback yesss

phaebz · 2013-11-09T16:11:09Z

One more thing: The fallback mechanism (lxml -> bs4) leads to a NotImplementedError in case the HTML is too broken/invalid for lxml parser. Now as long as xpath is not supported with bs4 flavor, the xpath feature is pretty much useless given all the broken HTML out there (I noticed this issue while preparing the examples, incidentally with the url I initially wanted to use the xpath with). That is unless we use the recover=True argument in lxml.html.HTMLParser instantiation. What do you think? One side effect of this more robust lxml HTML parsing is that tests which assert whether XMLSyntaxError has been raised (if called without fallback, i.e. flavor=['lxml']) will obviously fail.

@cpcloud Is there another important reason why you used the simple lxml parsing, i.e. recover=False, when you first wrote that part?

cancan101 · 2013-11-09T16:13:42Z

See #5131
On Nov 9, 2013 11:11 AM, "phaebz" [email protected] wrote:

One more thing: The fallback mechanism (lxml -> bs4) leads to a
NotImplementedError in case the HTML is too broken/invalid for lxml
parser. Now as long as xpath is not supported with bs4 flavor, the xpath
feature is pretty much useless given all the broken HTML out there (I
noticed this issue while preparing the examples, incidentally with the url
I initially wanted to use the xpath with). That is unless we use the
recover=True argument in lxml.html.HTMLParser instantiation. What do you
think? One side effect of this more robust lxml HTML parsing is that tests
which assert whether XMLSyntaxError has been raised (if called without
fallback, i.e. flavor=['lxml']) will obviously fail.
@cpcloud https://github.com/cpcloud Is there another important reason
why you used the simple lxml parsing, i.e. recover=False, when you first
wrote that part?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5416#issuecomment-28130400
.

phaebz · 2013-11-09T17:07:40Z

Bummer. It seems I will need to add the xpath feature after these changes...

One last thing that came to my mind: What do you think of using http://lxml.de/elementsoup.html in this context? This could then be used if a fallback is needed and also supports xpath expressions. OTOH, I think @cpcloud is against more lxml dependencies and usage.

cancan101 · 2013-11-09T17:10:01Z

@phaebz What do you mean by "after these changes"? #5131 is a closed PR. It was replaced with #5395. I am not sure if that PR will be accepted for v13.

phaebz · 2013-11-09T17:14:31Z

I meant your API changes/cleanup. AFAICS, the xpath feature would then be added via a Flavor subclass that uses liberal lxml parsing.

cancan101 · 2013-11-09T17:17:04Z

I don't think that has to be the case. I removed io as an argument to the constructor on this line and you added xpath: https://github.com/pydata/pandas/pull/5395/files#diff-a768e0e688020658ae7fb0dc23609b78L168

those are not mutually exclusive.

phaebz · 2013-11-09T17:30:41Z

My reasoning is: For now, the xpath feature only works with the lxml parser. This parser is used in recover=False mode and falls back to bs4 flavor. So unless the lxml parser is somehow used in recover=True mode there is no way to have xpath feature for "wild" HTML. My conclusion is that a user who wants to use xpath expressions has to use a custom parser. Am I missing something?

cancan101 · 2013-11-09T18:50:44Z

Yes you are missing something. The lxml parser does NOT fall back to bs4. The default parser which is the list of lxml and bs4 first tries lxml and then falls back to bs4. So if you are using xpath, just specify the flavor to be lxml only.

See: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.read_html.html#pandas.read_html:

The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

phaebz · 2013-11-09T19:07:37Z

Ok, you have a point there, I was mixing the terms flavor and parser in my hasty comment above.

More importantly: If xpath and flavor=['lxml'] then the parser still needs recover=True when dealing with malformed HTML - this was my point in #5416 (comment). Where should recover=True be specified? I was thinking it should be in your new Flavor subclass?

cancan101 · 2013-11-09T19:10:12Z

The conclusion from #5131 is to make the lxml parser extensible but not to add a subclass of lxml parsing with recover=True to the pandas code base. Once PR #5395 is merged in, it should be very easy (~ 5 lines of code) to write your own lxml-liberal parser that you can use.

phaebz · 2013-11-09T19:28:11Z

Ok, thanks for clarification.

Once PR #5395 is merged in, it should be very easy (~ 5 lines of code) to write your own lxml-liberal parser that you can use.

This means that a user who wants the xpath feature must also specify his/her own parser?

cancan101 · 2013-11-09T20:01:18Z

I dont see why that has to be the case. Assuming this pr goes through, all existing parsers will be passed an xpath. It's then then up to the parser to support the xpath or not.

phaebz · 2013-11-09T21:34:10Z

Thanks for your patience, I think I get it now, but let's continue discussion again after #5395 made it through.

So then I am now working on examples and would like to see the rendered io.rst, but ran into #5479.

cpcloud · 2013-11-10T19:18:15Z

@jreback @jtratner pushing to 0.14 since this PR depends on #5395.

ghost · 2013-12-30T20:58:12Z

I'm not convinced we should make read_html any more general.
see #5389 (comment)

ghost · 2014-01-01T03:01:53Z

Closing as discussed in #5389.

phaebz added 3 commits November 3, 2013 15:38

io.html.read_html support XPath expressions for table selection (only…

48ed43b

… lxml parser atm)

Coverage tests for match and attr parameters

de94512

XPath expression has to match table elements only

9a300b4

jtratner reviewed Nov 3, 2013
View reviewed changes

cancan101 reviewed Nov 3, 2013
View reviewed changes

Further testing for XPath feature

563a955

cpcloud reviewed Nov 7, 2013
View reviewed changes

phaebz added 2 commits November 7, 2013 19:20

Correct format specifiers for XPath string formatting

432269b

Release notes addition

c2bcbc9

ghost closed this Jan 1, 2014

This pull request was closed.


		tables = doc.xpath(xpath_expr, namespaces=_re_namespace)
		else:

io.html.read_html support XPath expressions for table selection #5416

io.html.read_html support XPath expressions for table selection #5416

Conversation

phaebz commented Nov 2, 2013

TomAugspurger commented Nov 2, 2013

phaebz commented Nov 2, 2013

TomAugspurger commented Nov 2, 2013

phaebz commented Nov 2, 2013

jtratner commented Nov 2, 2013

phaebz commented Nov 3, 2013

jtratner commented Nov 3, 2013

jtratner commented Nov 3, 2013

phaebz commented Nov 3, 2013

jtratner commented Nov 3, 2013

phaebz commented Nov 3, 2013

jtratner Nov 3, 2013

Choose a reason for hiding this comment

phaebz Nov 3, 2013

Choose a reason for hiding this comment

jtratner Nov 3, 2013

Choose a reason for hiding this comment

phaebz commented Nov 3, 2013

jreback commented Nov 3, 2013

cancan101 Nov 3, 2013

Choose a reason for hiding this comment

phaebz Nov 3, 2013

Choose a reason for hiding this comment

cancan101 commented Nov 3, 2013

phaebz commented Nov 3, 2013

phaebz commented Nov 3, 2013

cancan101 commented Nov 3, 2013

cancan101 commented Nov 3, 2013

jreback commented Nov 3, 2013

phaebz commented Nov 3, 2013

phaebz commented Nov 4, 2013

cancan101 commented Nov 4, 2013

phaebz commented Nov 4, 2013

cancan101 commented Nov 4, 2013

jreback commented Nov 5, 2013

cpcloud Nov 7, 2013

Choose a reason for hiding this comment

cpcloud commented Nov 7, 2013

jreback commented Nov 7, 2013

phaebz commented Nov 8, 2013

phaebz commented Nov 9, 2013

cancan101 commented Nov 9, 2013

phaebz commented Nov 9, 2013

cancan101 commented Nov 9, 2013

phaebz commented Nov 9, 2013

cancan101 commented Nov 9, 2013

phaebz commented Nov 9, 2013

cancan101 commented Nov 9, 2013

phaebz commented Nov 9, 2013

cancan101 commented Nov 9, 2013

phaebz commented Nov 9, 2013

cancan101 commented Nov 9, 2013

phaebz commented Nov 9, 2013

cpcloud commented Nov 10, 2013

ghost commented Dec 30, 2013

ghost commented Jan 1, 2014