Skip to content

io.html.read_html support XPath expressions for table selection #5416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

phaebz
Copy link
Contributor

@phaebz phaebz commented Nov 2, 2013

First tentative PR as discussed in #5389

@TomAugspurger
Copy link
Contributor

Could you hook up Travis? There's some instructions here

Just holler if you run into any issues.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 2, 2013

I have a question concerning the first travis-ci build. In the docs here it says that the first build has to be done with a push to the repo. Is there any other way? A dummy commit+push just for ci testing would be ugly and reforking, patching, PRing tedious...

@TomAugspurger
Copy link
Contributor

Can you just push this commit again? git commit -C HEAD --amend and then force push to this branch.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 2, 2013

Build is running, thanks!

@jtratner
Copy link
Contributor

jtratner commented Nov 2, 2013

Just make sure you cover at least: invalid xpath, non-existent xpath, xpath
matching multiple, valid xpath matching None. Xpath that accidentally
references within a table (which should probably fail) and xpath refing
table directly. Looks like you've covered some of these cases.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

I covered the needed test cases. I thought about "Xpath that accidentally references within a table" and generalized to "XPath does not return node list only containing tables". Please review.

Also, the travis build for python 2.6 fails. Importing lxml.etree.XPathEvalError fails with

ImportError: No module named lxml.etree

I never worked with 2.6.x Python lxml, so how to solve this?

Edit: direct link

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

The 2.6 build doesn't have lxml, you need to guard your test cases to check if lxml is installed (look at other test cases /modules to see how this is done).

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

@phaebz so the contract for this is: "The XPath must return a set of table nodes and only table nodes"?

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

@jtratner Yes, that was my idea.

Regarding lxml checking in tests, I will look into it.

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

it's _skip_if_no('lxml')

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

In the setUpClass method of TestReadHtmlLxml it says exactly that. This is why I originally thought that the lxml installed on the travis python 2.6 had a different module organisation, i.e. no lxml.etree.

@@ -18,6 +18,8 @@
from numpy.random import rand
from numpy.testing.decorators import slow

from lxml.etree import XPathEvalError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the problem - need to guard this import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it with a

try:
    from lxml.etree import XPathEvalError
except ImportError:
    pass

thanks for pushing my nose into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem - it can be confusing to figure out how everything should work
with a big codebase like pandas.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

Finally, all builds pass. What do you think?

@jreback
Copy link
Contributor

jreback commented Nov 3, 2013

don't u need to check if _HAVE_LXML is False and xpath is not None? (in the code I mean) and raise some kind of ValueError?

@@ -31,6 +31,7 @@

try:
import lxml
from lxml.etree import XPathEvalError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr, this one sneaked in while working on the tests...

@cancan101
Copy link
Contributor

You probably want to skip the xpath tests if import lxml fails

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

@cancan101 See my post above, this should be taken care of in the setUpClass method of TestReadHtmlLxml.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

@jreback I am not sure I understand you correctly. In _parse I raise a NotImplementedError if xpath is not None and flavor is one of bs4 or html5lib - so as to indicate that someone could implement it for these flavors if needed. Is this too indirect?

@cancan101
Copy link
Contributor

@phaebz He is saying that you never use or check _HAS_LXML

@cancan101
Copy link
Contributor

And that if _HAS_LXML=False and the user has set an xpath, there should be an error to indicate that parsing will not succeed.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2013

@phaebz what @cancan101 said!

@phaebz
Copy link
Contributor Author

phaebz commented Nov 3, 2013

I put this check into the _parse function, is there a better place?

@phaebz
Copy link
Contributor Author

phaebz commented Nov 4, 2013

Just to be clear, in my previous post I was referring to the check you proposed (xpath and _HAS_LXML=False) which was amended to the last commit.

@cancan101
Copy link
Contributor

@phaebz Now I see where _HAS_LXML is being checked, but now where is _HAS_LXML being set?

@phaebz
Copy link
Contributor Author

phaebz commented Nov 4, 2013

@cancan101 Isn't this done here?

@cancan101
Copy link
Contributor

Yes, my bad.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2013

@phaebz can you squash down to a single commit?

@cpcloud ok?


tables = doc.xpath(xpath_expr, namespaces=_re_namespace)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to dump the below into a function...but not necessary for this PR.

@cpcloud
Copy link
Member

cpcloud commented Nov 7, 2013

@phaebz Needs release notes and an example or two. After that, good to go.

@jreback
Copy link
Contributor

jreback commented Nov 7, 2013

@phaebz ok...going to mark this for 0.13....

@phaebz
Copy link
Contributor Author

phaebz commented Nov 8, 2013

@cpcloud Did the easy/fast bits (not yet pushed). Will have a look and finish on the weekend.
@jreback yesss

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

One more thing: The fallback mechanism (lxml -> bs4) leads to a NotImplementedError in case the HTML is too broken/invalid for lxml parser. Now as long as xpath is not supported with bs4 flavor, the xpath feature is pretty much useless given all the broken HTML out there (I noticed this issue while preparing the examples, incidentally with the url I initially wanted to use the xpath with). That is unless we use the recover=True argument in lxml.html.HTMLParser instantiation. What do you think? One side effect of this more robust lxml HTML parsing is that tests which assert whether XMLSyntaxError has been raised (if called without fallback, i.e. flavor=['lxml']) will obviously fail.

@cpcloud Is there another important reason why you used the simple lxml parsing, i.e. recover=False, when you first wrote that part?

@cancan101
Copy link
Contributor

See #5131
On Nov 9, 2013 11:11 AM, "phaebz" [email protected] wrote:

One more thing: The fallback mechanism (lxml -> bs4) leads to a
NotImplementedError in case the HTML is too broken/invalid for lxml
parser. Now as long as xpath is not supported with bs4 flavor, the xpath
feature is pretty much useless given all the broken HTML out there (I
noticed this issue while preparing the examples, incidentally with the url
I initially wanted to use the xpath with). That is unless we use the
recover=True argument in lxml.html.HTMLParser instantiation. What do you
think? One side effect of this more robust lxml HTML parsing is that tests
which assert whether XMLSyntaxError has been raised (if called without
fallback, i.e. flavor=['lxml']) will obviously fail.
@cpcloud https://github.com/cpcloud Is there another important reason
why you used the simple lxml parsing, i.e. recover=False, when you first
wrote that part?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5416#issuecomment-28130400
.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

Bummer. It seems I will need to add the xpath feature after these changes...

One last thing that came to my mind: What do you think of using http://lxml.de/elementsoup.html in this context? This could then be used if a fallback is needed and also supports xpath expressions. OTOH, I think @cpcloud is against more lxml dependencies and usage.

@cancan101
Copy link
Contributor

@phaebz What do you mean by "after these changes"? #5131 is a closed PR. It was replaced with #5395. I am not sure if that PR will be accepted for v13.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

I meant your API changes/cleanup. AFAICS, the xpath feature would then be added via a Flavor subclass that uses liberal lxml parsing.

@cancan101
Copy link
Contributor

I don't think that has to be the case. I removed io as an argument to the constructor on this line and you added xpath: https://github.com/pydata/pandas/pull/5395/files#diff-a768e0e688020658ae7fb0dc23609b78L168

those are not mutually exclusive.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

My reasoning is: For now, the xpath feature only works with the lxml parser. This parser is used in recover=False mode and falls back to bs4 flavor. So unless the lxml parser is somehow used in recover=True mode there is no way to have xpath feature for "wild" HTML. My conclusion is that a user who wants to use xpath expressions has to use a custom parser. Am I missing something?

@cancan101
Copy link
Contributor

Yes you are missing something. The lxml parser does NOT fall back to bs4. The default parser which is the list of lxml and bs4 first tries lxml and then falls back to bs4. So if you are using xpath, just specify the flavor to be lxml only.

See: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.read_html.html#pandas.read_html:

The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

Ok, you have a point there, I was mixing the terms flavor and parser in my hasty comment above.

More importantly: If xpath and flavor=['lxml'] then the parser still needs recover=True when dealing with malformed HTML - this was my point in #5416 (comment). Where should recover=True be specified? I was thinking it should be in your new Flavor subclass?

@cancan101
Copy link
Contributor

The conclusion from #5131 is to make the lxml parser extensible but not to add a subclass of lxml parsing with recover=True to the pandas code base. Once PR #5395 is merged in, it should be very easy (~ 5 lines of code) to write your own lxml-liberal parser that you can use.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

Ok, thanks for clarification.

Once PR #5395 is merged in, it should be very easy (~ 5 lines of code) to write your own lxml-liberal parser that you can use.

This means that a user who wants the xpath feature must also specify his/her own parser?

@cancan101
Copy link
Contributor

I dont see why that has to be the case. Assuming this pr goes through, all existing parsers will be passed an xpath. It's then then up to the parser to support the xpath or not.

@phaebz
Copy link
Contributor Author

phaebz commented Nov 9, 2013

Thanks for your patience, I think I get it now, but let's continue discussion again after #5395 made it through.

So then I am now working on examples and would like to see the rendered io.rst, but ran into #5479.

@cpcloud
Copy link
Member

cpcloud commented Nov 10, 2013

@jreback @jtratner pushing to 0.14 since this PR depends on #5395.

@ghost
Copy link

ghost commented Dec 30, 2013

I'm not convinced we should make read_html any more general.
see #5389 (comment)

@ghost
Copy link

ghost commented Jan 1, 2014

Closing as discussed in #5389.

@ghost ghost closed this Jan 1, 2014
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants