DOC: document bs4/lxml/html5lib issues #3751

cpcloud · 2013-06-03T22:24:22Z

to summarize this mess of deps and check my thoughts here

if a user insists on using `lxml` (either with or without `bs4`)

warning about its inability to deal with the modern web
warning saying that the user should install html5lib and bs4 so that a page will parse even if lxml barfs
test coverage for failing and passing pages (things that would parse "correctly" before will now fail since the parser will be extremely strict) thus only pages validated by the DTD will even try to parse

(to be fair I was really enthusiastic about lxml because of how fast it is but now i'm sort of against it)

what users should really do

install bs4
install html5lib
happily parse things into DataFrames with a low amount of stress

anaconda + `lxml` (no `bs4`)

no problems (modulo the above warnings)

@wesm maybe you could chime in about what (if anything) you did to libxml2/libxslt i wasn't clear on the details from the mailing list.

anaconda + `bs4` + `lxml`

make sure that you're using bs4==4.2.1
make sure that you're using lxml==3.2.1
workout the details of how to do this with conda (i did this already, but it was 2 or 3 AM so I'm a little foggy on the details)

anaconda + `bs4` + `html5lib` (no `lxml`)

happy parsing of HTML tables

this will be in a gotcha that will be linked to from a warning at the top of the read html section of io.rst

jreback · 2013-06-04T00:48:50Z

this looks good...are you going to add something about the valid installations (and invalid ones that we know about)? (or is that already in install?)

cpcloud · 2013-06-04T00:52:02Z

Yeah I just wanted to put it up to remind myself. Should be done by
tomorrow.
On Jun 3, 2013 8:49 PM, "jreback" [email protected] wrote:

this looks good...are you going to add something about the valid
installations (and invalid ones that we know about)? (or is that already in
install?)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3751#issuecomment-18882997
.

cpcloud · 2013-06-04T04:56:50Z

I'll put a link to the read HTML gotchas in the install and a mention of the anaconda issues, so that I can give a bit more detail in the gotcha.

jreback · 2013-06-04T15:02:46Z

just let me know when you want to merge this

cpcloud · 2013-06-04T16:59:13Z

this warning is going to be gigantic. i hope this doesn't discourage people from trying this :(

jreback · 2013-06-04T17:01:20Z

hahah

jreback · 2013-06-04T17:01:36Z

they would try ti w/o the warning then go WTF!

cpcloud · 2013-06-04T17:02:35Z

jreback · 2013-06-04T17:10:12Z

I like it! +1

also update README.rst (unfort the warnings don't look quite as good there)

cpcloud · 2013-06-04T17:12:40Z

ah yes. thanks.

cpcloud · 2013-06-04T17:15:44Z

now to write a tome about anaconda...

list install proc for apt-get enabled systems

cpcloud · 2013-06-04T17:42:55Z

should probably wait 2 merge this until i've actually implemented the mentioned fallback to html5lib if lxml fails to parse

jreback · 2013-06-04T17:47:34Z

html parsing is amazing complicated (because the standard is not followed)

jreback · 2013-06-04T17:52:01Z

web sites should be REQUIRED to post csv of there table data!

cpcloud · 2013-06-04T17:52:47Z

when i first starting doing this i was like: "i'm going to implement this in CYTHON". um yeah way too optimistic.

cpcloud · 2013-06-04T17:54:33Z

majority of tables are not semantic they are used for formatting UGH

cpcloud · 2013-06-04T17:57:12Z

what i really need now is to operate my 4 different vagrant boxes thru tmux all at the same time so i can test this beast without repeating myself all the time!

jreback · 2013-06-04T18:01:27Z

you ARE a maniac!

cpcloud · 2013-06-04T18:07:26Z

i should have the fallback by tonight...or some time early in the am

cpcloud · 2013-06-04T18:07:42Z

shall i squash here?

jreback · 2013-06-04T18:19:45Z

yes....I almost always squash down to a few commits

jreback · 2013-06-04T19:11:08Z

if you want I can merge this with say 1/2 hour...that way it will be in today's docs.....up 2 u

cpcloud · 2013-06-04T19:26:50Z

when is the doc build? i'm about to participate in an eye movement experiment that will take 1/2 hour so i can't have these changes for another 45 min

jreback · 2013-06-04T19:30:10Z

make sure your eye are open!

i think sometime between 4-5 pm (changes show up after 5 est)

its no biggie.....can do whenever...when i make a big doc change i like to see it on the main site to read over again.....

DOC: change formatting DOC: more formatting DOC: add bold substitutions DOC: fill out bold links and rephrase DOC: fill out link to gotchas DOC: add gigantic install.rst warning DOC: move version note about html5lib to first mention of it DOC: add same to readme and add boto to install.rst DOC: add anaconda note DOC: add note about debian based system installation DOC: add correct lexer for pygments formatting of code snippets DOC: move boto up

cpcloud · 2013-06-04T20:01:30Z

it's there if u want to merge

DOC: document bs4/lxml/html5lib issues

jreback · 2013-06-04T20:02:58Z

thansk!

cpcloud · 2013-06-04T21:07:41Z

@jreback what do you think about removing bs4 + lxml support and have either lxml alone, or bs4 + html5lib. the point being why go through bs4 if you don't have to, if lxml can parse the page it should do it in the fastest way possible and going thru bs4 makes no sense. too much redundancy

cpcloud · 2013-06-04T21:10:12Z

more doc changes, but i don't mind making them shorter, and a whole lot of extra code surrounding imports and other cruft in html.py can be removed if i do this

jreback · 2013-06-04T21:10:28Z

@cpcloud docs are updated (small issue on Read HTML with attributes ?)

jreback · 2013-06-04T21:11:25Z

I thought you do lxml, then fallback to bs4 + html5lib ? only, right?

cpcloud · 2013-06-04T21:12:14Z

oh ok, yeah duh...

cpcloud · 2013-06-04T21:16:53Z

should there even be a choice then?

jreback · 2013-06-04T21:22:36Z

I would make your flavor arg control this

e.g. flavor={ lxml, bs_html5lib }, so user could specify the parser/backend as one, simpler that way I guess

then if no flavor is specified (e.g. make the default flavor=None) then you can do lxml, then bs_html5lib
(always subject to if these flavors are installed)
(and if they specify a flavor and its not installed, then raise)

cpcloud · 2013-06-04T21:25:39Z

yep doing all the except the flavor=None bit

cpcloud · 2013-06-04T21:25:45Z

thanks

cpcloud · 2013-06-04T21:59:18Z

@y-p really starting to feel your sentiment about not doing stuff "just because".

cpcloud · 2013-06-05T01:03:24Z

i think it's best to make html5lib a requirement, since 1) it generates valid markup 2) both lxml and bs4 can use it that way you only have to install one or the other

cpcloud · 2013-06-05T01:03:40Z

also drastically simplifies error handling

cpcloud · 2013-06-05T01:07:34Z

in fact in that case it's best to just install lxml + html5lib and be done with it since that yields the best of both worlds but user ease of use i will keep bs4 since the install is easier

jreback · 2013-06-05T01:25:42Z

so if i understand this correctly

lxml: fast but only works on valid markup, install hard
html5lib: slower but works in lots of cases,install easy
bs: slower but works in lots of cases, install med hard

as a user I think you generally just want to say parse it (unless you specify)

maybe accept flavor=None (default), flavor=parser, flavor=[list of parsers]

so if you are given a parsers (lxml,html5lib,bs) you will try it, otherwise raise
a list of parsers, you can try them in term
default = try say: lxml,bs,html5lib?

of course if a parser is not installed need to skip

?

cpcloud · 2013-06-05T01:33:52Z

i was a little preemptive here turns out lxml still sucks with html5lib

cpcloud · 2013-06-05T01:35:15Z

u got the list right except i would change to:

lxml: same as above
html5lib: slower but works in all the cases i've thrown at it, install easy
bs4: same as html5lib

cpcloud · 2013-06-05T01:36:38Z

bs4 and html5lib are inextricably bound together since bs4 + lxml at this point is just wasting code since the results will still as invalid as lxml results if you're not strict, plus html5lib is pretty low level.

cpcloud · 2013-06-05T01:37:11Z

your original "fail lxml -> bs4 + html5lib" is what i'm thinking now, that passes all tests, assuming everything is installed.

cpcloud · 2013-06-05T01:39:08Z

of course because lxml is failing out most of the time, often the tests get run twice which means it takes about 48s to run test_html.py

cpcloud · 2013-06-05T01:39:22Z

but it does the right thing

jreback · 2013-06-05T01:59:05Z

I guess you could have some of the tests skip lxml, and just mark them slow (so they get run in at least 1 travis)

cpcloud · 2013-06-05T02:06:08Z

i will clean up the included html with tidy that the lxml and other tests get run on (they are subclassed right now since i want to run the tests over the same input), then all the network tests will be invalid in the eyes of lxml so that will fallback on bs4 + html5lib

jreback added a commit that referenced this pull request Jun 4, 2013

Merge pull request #3751 from cpcloud/read-html-bs4-install-docs

6653bc0

DOC: document bs4/lxml/html5lib issues

jreback merged commit 6653bc0 into pandas-dev:master Jun 4, 2013

cpcloud deleted the read-html-bs4-install-docs branch June 4, 2013 20:37

DOC: document bs4/lxml/html5lib issues #3751

DOC: document bs4/lxml/html5lib issues #3751

Conversation

cpcloud commented Jun 3, 2013

if a user insists on using lxml (either with or without bs4)

what users should really do

anaconda + lxml (no bs4)

anaconda + bs4 + lxml

anaconda + bs4 + html5lib (no lxml)

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

jreback commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 4, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

jreback commented Jun 5, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

cpcloud commented Jun 5, 2013

jreback commented Jun 5, 2013

cpcloud commented Jun 5, 2013

if a user insists on using `lxml` (either with or without `bs4`)

anaconda + `lxml` (no `bs4`)

anaconda + `bs4` + `lxml`

anaconda + `bs4` + `html5lib` (no `lxml`)