-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: error in read_html when parsing badly-escaped HTML from an io object #17975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interestingly, when adding a test for the patch, I noticed that stuffing that test document into a I'll investigate that further tomorrow. |
Same issue: Cannot >>> from urllib.request import urlopen
>>> import pandas as pd
>>> url = 'http://en.wikipedia.org/wiki/Matplotlib'
>>> assert pd.read_html(urlopen(url), 'Advantages', 'bs4') # works with bs4 alone
>>> pd.read_html(urlopen(url), 'Advantages')
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
pd.read_html(urlopen(url), 'Advantages')
File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 915, in read_html
keep_default_na=keep_default_na)
File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 749, in _parse
raise_with_traceback(retained)
File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 367, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <http.client.HTTPResponse object at 0x0000000005621358> Note that one cannot do I think >>> import pandas as pd
>>> from mock import Mock
>>> def mock_urlopen(data, url='http://spam'):
return Mock(**{'geturl.return_value': url, 'read.side_effect': [data, '', '']})
>>> good = mock_urlopen('<table><tr><td>spam<br />eggs</td></tr></table>')
>>> bad = mock_urlopen('<table><tr><td>spam<wbr />eggs</td></tr></table>')
>>> assert pd.read_html(good)
>>> assert pd.read_html(bad, flavor='bs4')
>>> bad.reset_mock()
>>> pd.read_html(bad)
Traceback (most recent call last):
...
ValueError: No text parsed from document: <Mock id='85948960'>
>>> bad.mock_calls
[call.geturl(),
call.tell(),
call.read(4000),
call.decode('ascii', 'strict'),
call.decode().decode('ascii', 'strict'),
call.decode().decode().find(':'),
call.read()] The second |
Minimal amendment: >>> bad = mock_urlopen('<table><tr><td>spam<wbr />eggs</td></tr></table>')
>>> pd.read_html(bad)
Traceback (most recent call last):
...
ValueError: No text parsed from document: <Mock id='50837656'>
>>> bad.mock_calls
[call.geturl(),
call.tell(),
call.read(4000),
call.read(3952),
call.decode('ascii', 'strict'),
call.decode().decode('ascii', 'strict'),
call.decode().decode().find(':'),
call.read()] Again, the last |
The only way to rewind a urlopen is re-requesting it or buffering it, unfortunately. This becomes a much more complex patch, then 😦 |
So i suppose that the try next parser should raise if we only have a filehandle (and not a path). would take that as a PR. |
We can seek for some IO handles, though. I don't see any reason not to add something like if hasattr(io, 'seek'):
io.seek(0) and raise a warning if hasattr(io, 'read') and not hasattr(io, 'seek') |
Sounds good to me. I think @jreback means that the |
ah, you're talking about ditching the fallthrough to the next parser entirely? |
I thought for io handles (possibly only non-seekable ones). Does not occur with file names, right? |
Yep, since _read() reopens the file for each parser if you're passing in filenames.
…On Sat, Oct 28, 2017 at 4:35 PM, Sebastian Bank ***@***.***> wrote:
I thought for io handles (possibly only non-seekable ones). Does not occur
with file names, right?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#17975 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEa8SQpni0IETbGKuerchH4awQd7JEV5ks5sw54QgaJpZM4QFbJ2>
.
|
Code Sample, a copy-pastable example if possible
Create
test.html
, with the contents:Problem description
Pandas attempts to invoke a series of parsers on HTML documents, returning when one produces a result, and continuing to the next on error. This works fine when passing a path or entire document to
read_html()
, but when an IO object is passed, the subsequent parsers will be reading from a file whose read cursor is at EOF, producing an inscrutable 'no text parsed from document' error.This can easily be fixed by rewinding the file with
seek(0)
before continuing to the next parser (will add PR shortly).Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: