io.html.read_html support XPath table identification #5389

phaebz · 2013-10-30T14:22:55Z

Feature request as in subject that came up on #5384. This would be for cases where table attrs alone don't cut it and manual lxml / bs4 could be avoided. What do you think?

cpcloud · 2013-10-30T17:06:51Z

I thought about this when I first wrote it and decided against it, mostly for time reasons and because exposing a user to XPath should be avoided, but I think this is probably a good idea for "power" users. @phaebz Want to take a crack at it? Happy to help you navigate the code, if you want.

jtratner · 2013-10-30T21:06:34Z

You could add this as a separate kwarg (xpath=None), that if not None would
take precedence over match. I'm +1 on this too.

phaebz · 2013-10-31T10:47:10Z

@cpcloud Sure, I will have a look into it and check back for navigation.

@jtratner Sounds good. Will do.

cpcloud · 2013-10-31T11:14:03Z

@phaebz BTW you can call read_html like this:

import pandas as pd
dfs = pd.read_html(match='Van Halen/Panama', attrs={'class': 'whatever'})

No need to type the fully qualified name since we import all IO functions into the top-level pandas namespace. 😄

phaebz · 2013-11-01T15:38:40Z

@cpcloud Good to know plus I liked the Eddie just there :)

I had a look into the code and I will begin with adapting the _LxmlFrameParser. In there, I would need to change the _parse_tables method. I would then take the xpath string argument passed to read_html and essentially just do an if/else around the existing xpath_expr. What do you think? Do you have other ideas?

phaebz · 2013-11-02T19:19:12Z

First attempt at implementation. I added the functionality to lxml parser only since AFAICS BeautifulSoup does not support XPath. It could be implemented for bs4 flavor with a fallback to http://lxml.de/elementsoup.html ? I also added some tests for good measure. Please review, I am looking forward to your feedback.

jtratner · 2013-11-02T21:27:10Z

Fine to just add to lxml. Maybe the selenium / phantomjs parser would work
with this too.

cpcloud · 2013-11-10T19:18:42Z

pushing to 0.14

ghost · 2013-12-30T20:51:44Z

I agree with @cpcloud's original reasoning. read_html's doesn't need to be this general.
read_html let's you pass in a match argument or choose a table from all those on the page
by inspecting/choosing from the returned list of tables. That covers the #5384 case as well afaict.

I think xpath support is going too far.

@cpcloud @jtratner, I vote to close. Do you feel strongly about this?

cpcloud · 2013-12-30T21:13:53Z

No strong feelings here.

cancan101 · 2014-01-01T01:35:46Z

Currently there is no context given with the list of tables, so having xpath would be helpful.

jtratner · 2014-01-01T01:56:59Z

I'm not interested in working on it and am favor of simplicity, so I vote
close.

ghost · 2014-01-01T02:38:25Z

@cancan101That's true in that making any feature more powerful is potentially helpful.
But balance, scope and focus should take priority in the context of a larger package.

I think the current API that @cpcloud put together is just expressive enough
without trying to do too much and in this specific case it actually handles the
motivating case out of the box without extension.

@phaebz, thanks for putting in the time doing this. I'm sorry we're going to
pass on merging this. We'd like to keep growing pandas with features that make
a significant difference to it's users without gradually morphing pandas into
something else. That sometimes means saying no to little changes that seem
innocent in themselves, it's the cumulative effect we're trying to keep tabs on.

jtratner · 2014-01-01T03:38:44Z

@cancan101 or @phaebz - you could also create a separate package that
implements some of the features you've described regarding HTML --> pandas.
I'm guessing there would be some others interested too.

ghost · 2014-01-01T03:48:21Z

Absolutely. and no one can get it in your way. :)

phaebz · 2014-01-18T16:28:41Z

@y-p @jtratner It was a nice exercise to work on anyways. I totally agree with you in terms of simplicity.

phaebz mentioned this issue Oct 30, 2013

io.html.read_html returning table "twice" #5384

Closed

phaebz mentioned this issue Nov 2, 2013

io.html.read_html support XPath expressions for table selection #5416

Closed

ghost closed this as completed Jan 1, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io.html.read_html support XPath table identification #5389

io.html.read_html support XPath table identification #5389

phaebz commented Oct 30, 2013

cpcloud commented Oct 30, 2013

jtratner commented Oct 30, 2013

phaebz commented Oct 31, 2013

cpcloud commented Oct 31, 2013

phaebz commented Nov 1, 2013

phaebz commented Nov 2, 2013

jtratner commented Nov 2, 2013

cpcloud commented Nov 10, 2013

ghost commented Dec 30, 2013

cpcloud commented Dec 30, 2013

cancan101 commented Jan 1, 2014

jtratner commented Jan 1, 2014

ghost commented Jan 1, 2014

jtratner commented Jan 1, 2014

ghost commented Jan 1, 2014

phaebz commented Jan 18, 2014

io.html.read_html support XPath table identification #5389

io.html.read_html support XPath table identification #5389

Comments

phaebz commented Oct 30, 2013

cpcloud commented Oct 30, 2013

jtratner commented Oct 30, 2013

phaebz commented Oct 31, 2013

cpcloud commented Oct 31, 2013

phaebz commented Nov 1, 2013

phaebz commented Nov 2, 2013

jtratner commented Nov 2, 2013

cpcloud commented Nov 10, 2013

ghost commented Dec 30, 2013

cpcloud commented Dec 30, 2013

cancan101 commented Jan 1, 2014

jtratner commented Jan 1, 2014

ghost commented Jan 1, 2014

jtratner commented Jan 1, 2014

ghost commented Jan 1, 2014

phaebz commented Jan 18, 2014