Skip to content

read_html(): rewinding [wip] #18017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 1, 2017
Merged

Conversation

erinzm
Copy link

@erinzm erinzm commented Oct 28, 2017

@erinzm erinzm changed the title read_html(): rewinding read_html(): rewinding [wip] Oct 28, 2017
@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

tests!

@erinzm
Copy link
Author

erinzm commented Oct 28, 2017

Unfortunately, non-seekable io objects typically still have a stub .seek() implementation which raises io.UnsupportedOperation. I'll trap that instead.

Thankfully, Python has a built-in solution for this, io.IOBase.seekable()!

@erinzm erinzm force-pushed the read-html-rewinding branch 2 times, most recently from 83c8dfc to 2d03b15 Compare October 28, 2017 23:24
# and try to rewind it before trying the next parser
if hasattr(io, 'seekable') and io.seekable():
io.seek(0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an if else

@pep8speaks
Copy link

pep8speaks commented Oct 29, 2017

Hello @LiamIm! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 29, 2017 at 01:08 Hours UTC

@erinzm
Copy link
Author

erinzm commented Oct 29, 2017

How should I test rewinding on file objects? Just add a malformed HTML file to pandas/tests/io/data?

@jreback
Copy link
Contributor

jreback commented Oct 29, 2017

there are several examples in the issue
you can also create a class that is has a read and seek method that wraps a url

io.seek(0)
elif hasattr(io, 'seekable') and not io.seekable():
# if we couldn't rewind it, let the user know
raise ValueError('The favor {} failed to parse your input. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

favor -> flavor

@jreback jreback added Error Reporting Incorrect or improved errors from pandas IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Oct 29, 2017
@erinzm erinzm force-pushed the read-html-rewinding branch 2 times, most recently from 9e3b9cb to eb9d525 Compare October 29, 2017 00:57
If lxml has read to the end of a file and then errored,
bs4/html5lib won't rewind it before trying to parse again,
and will throw a `ValueError: No text parsed from document`.

This patch fixes this issue, by rewinding the file object
when a parser fails. If the object was IO-ish but not seekable,
we throw an error notifying the user and asking them to try
a different flavor.
@erinzm erinzm force-pushed the read-html-rewinding branch from eb9d525 to a85cd42 Compare October 29, 2017 01:08
@codecov
Copy link

codecov bot commented Oct 29, 2017

Codecov Report

Merging #18017 into master will increase coverage by 0.01%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18017      +/-   ##
==========================================
+ Coverage   91.23%   91.25%   +0.01%     
==========================================
  Files         163      163              
  Lines       50091    50095       +4     
==========================================
+ Hits        45703    45712       +9     
+ Misses       4388     4383       -5
Flag Coverage Δ
#multiple 89.05% <75%> (+0.02%) ⬆️
#single 40.24% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 85.66% <75%> (+0.49%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/io/pytables.py 92.84% <0%> (+0.04%) ⬆️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5959ee3...a85cd42. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 29, 2017

Codecov Report

Merging #18017 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18017      +/-   ##
==========================================
+ Coverage   91.24%   91.25%   +0.01%     
==========================================
  Files         163      163              
  Lines       50091    50095       +4     
==========================================
+ Hits        45704    45713       +9     
+ Misses       4387     4382       -5
Flag Coverage Δ
#multiple 89.06% <100%> (+0.02%) ⬆️
#single 40.24% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/html.py 85.98% <100%> (+0.8%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5959ee3...a85cd42. Read the comment docs.

@jreback jreback added this to the 0.22.0 milestone Nov 1, 2017
@jreback jreback merged commit 15fa4bd into pandas-dev:master Nov 1, 2017
@jreback
Copy link
Contributor

jreback commented Nov 1, 2017

thanks @LiamIm nice PR!

1kastner pushed a commit to 1kastner/pandas that referenced this pull request Nov 5, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: error in read_html when parsing badly-escaped HTML from an io object
4 participants