Skip to content

DOC: document bs4/lxml/html5lib issues #3751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2013
Merged

DOC: document bs4/lxml/html5lib issues #3751

merged 1 commit into from
Jun 4, 2013

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jun 3, 2013

to summarize this mess of deps and check my thoughts here

if a user insists on using lxml (either with or without bs4)

  • warning about its inability to deal with the modern web
  • warning saying that the user should install html5lib and bs4 so that a page will parse even if lxml barfs
  • test coverage for failing and passing pages (things that would parse "correctly" before will now fail since the parser will be extremely strict) thus only pages validated by the DTD will even try to parse

(to be fair I was really enthusiastic about lxml because of how fast it is but now i'm sort of against it)

what users should really do

  • install bs4
  • install html5lib
  • happily parse things into DataFrames with a low amount of stress

anaconda + lxml (no bs4)

  • no problems (modulo the above warnings)

@wesm maybe you could chime in about what (if anything) you did to libxml2/libxslt i wasn't clear on the details from the mailing list.

anaconda + bs4 + lxml

  • make sure that you're using bs4==4.2.1
  • make sure that you're using lxml==3.2.1
  • workout the details of how to do this with conda (i did this already, but it was 2 or 3 AM so I'm a little foggy on the details)

anaconda + bs4 + html5lib (no lxml)

  • happy parsing of HTML tables

this will be in a gotcha that will be linked to from a warning at the top of the read html section of io.rst

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

this looks good...are you going to add something about the valid installations (and invalid ones that we know about)? (or is that already in install?)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

Yeah I just wanted to put it up to remind myself. Should be done by
tomorrow.
On Jun 3, 2013 8:49 PM, "jreback" [email protected] wrote:

this looks good...are you going to add something about the valid
installations (and invalid ones that we know about)? (or is that already in
install?)


Reply to this email directly or view it on GitHubhttps://github.com//pull/3751#issuecomment-18882997
.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

I'll put a link to the read HTML gotchas in the install and a mention of the anaconda issues, so that I can give a bit more detail in the gotcha.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

just let me know when you want to merge this

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

this warning is going to be gigantic. i hope this doesn't discourage people from trying this :(

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

hahah

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

they would try ti w/o the warning then go WTF!

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

huge-warning

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

I like it! +1

also update README.rst (unfort the warnings don't look quite as good there)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

ah yes. thanks.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

now to write a tome about anaconda...

  • list install proc for apt-get enabled systems

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

should probably wait 2 merge this until i've actually implemented the mentioned fallback to html5lib if lxml fails to parse

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

html parsing is amazing complicated (because the standard is not followed)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

web sites should be REQUIRED to post csv of there table data!

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

when i first starting doing this i was like: "i'm going to implement this in CYTHON". um yeah way too optimistic.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

majority of tables are not semantic they are used for formatting UGH

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

what i really need now is to operate my 4 different vagrant boxes thru tmux all at the same time so i can test this beast without repeating myself all the time!

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

you ARE a maniac!

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

i should have the fallback by tonight...or some time early in the am

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

shall i squash here?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

yes....I almost always squash down to a few commits

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

if you want I can merge this with say 1/2 hour...that way it will be in today's docs.....up 2 u

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

when is the doc build? i'm about to participate in an eye movement experiment that will take 1/2 hour so i can't have these changes for another 45 min

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

make sure your eye are open!

i think sometime between 4-5 pm (changes show up after 5 est)

its no biggie.....can do whenever...when i make a big doc change i like to see it on the main site to read over again.....

DOC: change formatting

DOC: more formatting

DOC: add bold substitutions

DOC: fill out bold links and rephrase

DOC: fill out link to gotchas

DOC: add gigantic install.rst warning

DOC: move version note about html5lib to first mention of it

DOC: add same to readme and add boto to install.rst

DOC: add anaconda note

DOC: add note about debian based system installation

DOC: add correct lexer for pygments formatting of code snippets

DOC: move boto up
@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

it's there if u want to merge

jreback added a commit that referenced this pull request Jun 4, 2013
DOC: document bs4/lxml/html5lib issues
@jreback jreback merged commit 6653bc0 into pandas-dev:master Jun 4, 2013
@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

thansk!

@cpcloud cpcloud deleted the read-html-bs4-install-docs branch June 4, 2013 20:37
@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

@jreback what do you think about removing bs4 + lxml support and have either lxml alone, or bs4 + html5lib. the point being why go through bs4 if you don't have to, if lxml can parse the page it should do it in the fastest way possible and going thru bs4 makes no sense. too much redundancy

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

more doc changes, but i don't mind making them shorter, and a whole lot of extra code surrounding imports and other cruft in html.py can be removed if i do this

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

@cpcloud docs are updated (small issue on Read HTML with attributes ?)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

I thought you do lxml, then fallback to bs4 + html5lib ? only, right?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

oh ok, yeah duh...

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

should there even be a choice then?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

I would make your flavor arg control this

e.g. flavor={ lxml, bs_html5lib }, so user could specify the parser/backend as one, simpler that way I guess

then if no flavor is specified (e.g. make the default flavor=None) then you can do lxml, then bs_html5lib
(always subject to if these flavors are installed)
(and if they specify a flavor and its not installed, then raise)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

yep doing all the except the flavor=None bit

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

thanks

@cpcloud
Copy link
Member Author

cpcloud commented Jun 4, 2013

@y-p really starting to feel your sentiment about not doing stuff "just because".

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

i think it's best to make html5lib a requirement, since 1) it generates valid markup 2) both lxml and bs4 can use it that way you only have to install one or the other

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

also drastically simplifies error handling

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

in fact in that case it's best to just install lxml + html5lib and be done with it since that yields the best of both worlds but user ease of use i will keep bs4 since the install is easier

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

so if i understand this correctly

lxml: fast but only works on valid markup, install hard
html5lib: slower but works in lots of cases,install easy
bs: slower but works in lots of cases, install med hard

as a user I think you generally just want to say parse it (unless you specify)

maybe accept flavor=None (default), flavor=parser, flavor=[list of parsers]

so if you are given a parsers (lxml,html5lib,bs) you will try it, otherwise raise
a list of parsers, you can try them in term
default = try say: lxml,bs,html5lib?

of course if a parser is not installed need to skip

?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

i was a little preemptive here turns out lxml still sucks with html5lib

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

u got the list right except i would change to:

  • lxml: same as above
  • html5lib: slower but works in all the cases i've thrown at it, install easy
  • bs4: same as html5lib

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

bs4 and html5lib are inextricably bound together since bs4 + lxml at this point is just wasting code since the results will still as invalid as lxml results if you're not strict, plus html5lib is pretty low level.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

your original "fail lxml -> bs4 + html5lib" is what i'm thinking now, that passes all tests, assuming everything is installed.

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

of course because lxml is failing out most of the time, often the tests get run twice which means it takes about 48s to run test_html.py

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

but it does the right thing

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

I guess you could have some of the tests skip lxml, and just mark them slow (so they get run in at least 1 travis)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 5, 2013

i will clean up the included html with tidy that the lxml and other tests get run on (they are subclassed right now since i want to run the tests over the same input), then all the network tests will be invalid in the eyes of lxml so that will fallback on bs4 + html5lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants