-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: document bs4/lxml/html5lib issues #3751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this looks good...are you going to add something about the valid installations (and invalid ones that we know about)? (or is that already in install?) |
Yeah I just wanted to put it up to remind myself. Should be done by
|
I'll put a link to the read HTML gotchas in the install and a mention of the anaconda issues, so that I can give a bit more detail in the gotcha. |
just let me know when you want to merge this |
this warning is going to be gigantic. i hope this doesn't discourage people from trying this :( |
hahah |
they would try ti w/o the warning then go WTF! |
I like it! +1 also update README.rst (unfort the warnings don't look quite as good there) |
ah yes. thanks. |
now to write a tome about anaconda...
|
should probably wait 2 merge this until i've actually implemented the mentioned fallback to html5lib if lxml fails to parse |
html parsing is amazing complicated (because the standard is not followed) |
web sites should be REQUIRED to post csv of there table data! |
when i first starting doing this i was like: "i'm going to implement this in CYTHON". um yeah way too optimistic. |
majority of tables are not semantic they are used for formatting UGH |
what i really need now is to operate my 4 different vagrant boxes thru tmux all at the same time so i can test this beast without repeating myself all the time! |
you ARE a maniac! |
i should have the fallback by tonight...or some time early in the am |
shall i squash here? |
yes....I almost always squash down to a few commits |
if you want I can merge this with say 1/2 hour...that way it will be in today's docs.....up 2 u |
when is the doc build? i'm about to participate in an eye movement experiment that will take 1/2 hour so i can't have these changes for another 45 min |
make sure your eye are open! i think sometime between 4-5 pm (changes show up after 5 est) its no biggie.....can do whenever...when i make a big doc change i like to see it on the main site to read over again..... |
DOC: change formatting DOC: more formatting DOC: add bold substitutions DOC: fill out bold links and rephrase DOC: fill out link to gotchas DOC: add gigantic install.rst warning DOC: move version note about html5lib to first mention of it DOC: add same to readme and add boto to install.rst DOC: add anaconda note DOC: add note about debian based system installation DOC: add correct lexer for pygments formatting of code snippets DOC: move boto up
it's there if u want to merge |
DOC: document bs4/lxml/html5lib issues
thansk! |
@jreback what do you think about removing bs4 + lxml support and have either lxml alone, or bs4 + html5lib. the point being why go through bs4 if you don't have to, if lxml can parse the page it should do it in the fastest way possible and going thru bs4 makes no sense. too much redundancy |
more doc changes, but i don't mind making them shorter, and a whole lot of extra code surrounding imports and other cruft in |
@cpcloud docs are updated (small issue on Read HTML with attributes ?) |
I thought you do lxml, then fallback to bs4 + html5lib ? only, right? |
oh ok, yeah duh... |
should there even be a choice then? |
I would make your flavor arg control this e.g. then if no flavor is specified (e.g. make the default |
yep doing all the except the |
thanks |
@y-p really starting to feel your sentiment about not doing stuff "just because". |
i think it's best to make |
also drastically simplifies error handling |
in fact in that case it's best to just install lxml + html5lib and be done with it since that yields the best of both worlds but user ease of use i will keep bs4 since the install is easier |
so if i understand this correctly lxml: fast but only works on valid markup, install hard as a user I think you generally just want to say parse it (unless you specify) maybe accept so if you are given a parsers (lxml,html5lib,bs) you will try it, otherwise raise of course if a parser is not installed need to skip ? |
i was a little preemptive here turns out lxml still sucks with html5lib |
u got the list right except i would change to:
|
bs4 and html5lib are inextricably bound together since bs4 + lxml at this point is just wasting code since the results will still as invalid as lxml results if you're not strict, plus html5lib is pretty low level. |
your original "fail lxml -> bs4 + html5lib" is what i'm thinking now, that passes all tests, assuming everything is installed. |
of course because lxml is failing out most of the time, often the tests get run twice which means it takes about 48s to run |
but it does the right thing |
I guess you could have some of the tests skip lxml, and just mark them slow (so they get run in at least 1 travis) |
i will clean up the included html with |
to summarize this mess of deps and check my thoughts here
if a user insists on using
lxml
(either with or withoutbs4
)html5lib
andbs4
so that a page will parse even iflxml
barfs(to be fair I was really enthusiastic about
lxml
because of how fast it is but now i'm sort of against it)what users should really do
bs4
html5lib
DataFrame
s with a low amount of stressanaconda +
lxml
(nobs4
)@wesm maybe you could chime in about what (if anything) you did to
libxml2
/libxslt
i wasn't clear on the details from the mailing list.anaconda +
bs4
+lxml
bs4==4.2.1
lxml==3.2.1
conda
(i did this already, but it was 2 or 3 AM so I'm a little foggy on the details)anaconda +
bs4
+html5lib
(nolxml
)this will be in a gotcha that will be linked to from a warning at the top of the read html section of io.rst