Skip to content

io.html.read_html support XPath table identification #5389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phaebz opened this issue Oct 30, 2013 · 16 comments
Closed

io.html.read_html support XPath table identification #5389

phaebz opened this issue Oct 30, 2013 · 16 comments
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@phaebz
Copy link
Contributor

phaebz commented Oct 30, 2013

Feature request as in subject that came up on #5384. This would be for cases where table attrs alone don't cut it and manual lxml / bs4 could be avoided. What do you think?

@cpcloud
Copy link
Member

cpcloud commented Oct 30, 2013

I thought about this when I first wrote it and decided against it, mostly for time reasons and because exposing a user to XPath should be avoided, but I think this is probably a good idea for "power" users. @phaebz Want to take a crack at it? Happy to help you navigate the code, if you want.

@jtratner
Copy link
Contributor

You could add this as a separate kwarg (xpath=None), that if not None would
take precedence over match. I'm +1 on this too.

@phaebz
Copy link
Contributor Author

phaebz commented Oct 31, 2013

@cpcloud Sure, I will have a look into it and check back for navigation.

@jtratner Sounds good. Will do.

@cpcloud
Copy link
Member

cpcloud commented Oct 31, 2013

@phaebz BTW you can call read_html like this:

import pandas as pd
dfs = pd.read_html(match='Van Halen/Panama', attrs={'class': 'whatever'})

No need to type the fully qualified name since we import all IO functions into the top-level pandas namespace. 😄

@phaebz
Copy link
Contributor Author

phaebz commented Nov 1, 2013

@cpcloud Good to know plus I liked the Eddie just there :)

I had a look into the code and I will begin with adapting the _LxmlFrameParser. In there, I would need to change the _parse_tables method. I would then take the xpath string argument passed to read_html and essentially just do an if/else around the existing xpath_expr. What do you think? Do you have other ideas?

@phaebz
Copy link
Contributor Author

phaebz commented Nov 2, 2013

First attempt at implementation. I added the functionality to lxml parser only since AFAICS BeautifulSoup does not support XPath. It could be implemented for bs4 flavor with a fallback to http://lxml.de/elementsoup.html ? I also added some tests for good measure. Please review, I am looking forward to your feedback.

@jtratner
Copy link
Contributor

jtratner commented Nov 2, 2013

Fine to just add to lxml. Maybe the selenium / phantomjs parser would work
with this too.

@cpcloud
Copy link
Member

cpcloud commented Nov 10, 2013

pushing to 0.14

@ghost
Copy link

ghost commented Dec 30, 2013

I agree with @cpcloud's original reasoning. read_html's doesn't need to be this general.
read_html let's you pass in a match argument or choose a table from all those on the page
by inspecting/choosing from the returned list of tables. That covers the #5384 case as well afaict.

I think xpath support is going too far.

@cpcloud @jtratner, I vote to close. Do you feel strongly about this?

@cpcloud
Copy link
Member

cpcloud commented Dec 30, 2013

No strong feelings here.

@cancan101
Copy link
Contributor

Currently there is no context given with the list of tables, so having xpath would be helpful.

@jtratner
Copy link
Contributor

jtratner commented Jan 1, 2014

I'm not interested in working on it and am favor of simplicity, so I vote
close.

@ghost
Copy link

ghost commented Jan 1, 2014

@cancan101That's true in that making any feature more powerful is potentially helpful.
But balance, scope and focus should take priority in the context of a larger package.

I think the current API that @cpcloud put together is just expressive enough
without trying to do too much and in this specific case it actually handles the
motivating case out of the box without extension.

@phaebz, thanks for putting in the time doing this. I'm sorry we're going to
pass on merging this. We'd like to keep growing pandas with features that make
a significant difference to it's users without gradually morphing pandas into
something else. That sometimes means saying no to little changes that seem
innocent in themselves, it's the cumulative effect we're trying to keep tabs on.

@ghost ghost closed this as completed Jan 1, 2014
@jtratner
Copy link
Contributor

jtratner commented Jan 1, 2014

@cancan101 or @phaebz - you could also create a separate package that
implements some of the features you've described regarding HTML --> pandas.
I'm guessing there would be some others interested too.

@ghost
Copy link

ghost commented Jan 1, 2014

Absolutely. and no one can get it in your way. :)

@phaebz
Copy link
Contributor Author

phaebz commented Jan 18, 2014

@y-p @jtratner It was a nice exercise to work on anyways. I totally agree with you in terms of simplicity.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

4 participants