-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
io.html.read_html support XPath expressions for table selection #5416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you hook up Travis? There's some instructions here Just holler if you run into any issues. |
I have a question concerning the first travis-ci build. In the docs here it says that the first build has to be done with a push to the repo. Is there any other way? A dummy commit+push just for ci testing would be ugly and reforking, patching, PRing tedious... |
Can you just push this commit again? |
Build is running, thanks! |
Just make sure you cover at least: invalid xpath, non-existent xpath, xpath |
I covered the needed test cases. I thought about "Xpath that accidentally references within a table" and generalized to "XPath does not return node list only containing tables". Please review. Also, the travis build for python 2.6 fails. Importing
I never worked with 2.6.x Python lxml, so how to solve this? Edit: direct link |
The 2.6 build doesn't have lxml, you need to guard your test cases to check if lxml is installed (look at other test cases /modules to see how this is done). |
@phaebz so the contract for this is: "The XPath must return a set of table nodes and only table nodes"? |
@jtratner Yes, that was my idea. Regarding lxml checking in tests, I will look into it. |
it's |
In the |
@@ -18,6 +18,8 @@ | |||
from numpy.random import rand | |||
from numpy.testing.decorators import slow | |||
|
|||
from lxml.etree import XPathEvalError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the problem - need to guard this import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it with a
try:
from lxml.etree import XPathEvalError
except ImportError:
pass
thanks for pushing my nose into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem - it can be confusing to figure out how everything should work
with a big codebase like pandas.
Finally, all builds pass. What do you think? |
don't u need to check if _HAVE_LXML is False and xpath is not None? (in the code I mean) and raise some kind of ValueError? |
@@ -31,6 +31,7 @@ | |||
|
|||
try: | |||
import lxml | |||
from lxml.etree import XPathEvalError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grr, this one sneaked in while working on the tests...
You probably want to skip the xpath tests if |
@cancan101 See my post above, this should be taken care of in the |
@jreback I am not sure I understand you correctly. In _parse I raise a NotImplementedError if xpath is not None and flavor is one of bs4 or html5lib - so as to indicate that someone could implement it for these flavors if needed. Is this too indirect? |
@phaebz He is saying that you never use or check |
And that if |
@phaebz what @cancan101 said! |
I put this check into the _parse function, is there a better place? |
Just to be clear, in my previous post I was referring to the check you proposed (xpath and |
@phaebz Now I see where |
@cancan101 Isn't this done here? |
Yes, my bad. |
|
||
tables = doc.xpath(xpath_expr, namespaces=_re_namespace) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to dump the below into a function...but not necessary for this PR.
@phaebz Needs release notes and an example or two. After that, good to go. |
@phaebz ok...going to mark this for 0.13.... |
One more thing: The fallback mechanism (lxml -> bs4) leads to a @cpcloud Is there another important reason why you used the simple lxml parsing, i.e. |
See #5131
|
Bummer. It seems I will need to add the xpath feature after these changes... One last thing that came to my mind: What do you think of using http://lxml.de/elementsoup.html in this context? This could then be used if a fallback is needed and also supports xpath expressions. OTOH, I think @cpcloud is against more lxml dependencies and usage. |
I meant your API changes/cleanup. AFAICS, the xpath feature would then be added via a |
I don't think that has to be the case. I removed those are not mutually exclusive. |
My reasoning is: For now, the xpath feature only works with the lxml parser. This parser is used in |
Yes you are missing something. The See: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.read_html.html#pandas.read_html: The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib. |
Ok, you have a point there, I was mixing the terms flavor and parser in my hasty comment above. More importantly: If |
Ok, thanks for clarification.
This means that a user who wants the |
I dont see why that has to be the case. Assuming this pr goes through, all existing parsers will be passed an xpath. It's then then up to the parser to support the xpath or not. |
I'm not convinced we should make read_html any more general. |
Closing as discussed in #5389. |
First tentative PR as discussed in #5389