BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

erinzm · 2017-10-25T04:34:54Z

Code Sample, a copy-pastable example if possible

Create test.html, with the contents:

<!doctype html>
<html>
<body>
<table>
	<tr><td>poorly-escaped cell with an & oh noes</td></tr>
</table>
</body>
</html>

>>> import pandas as pd
>>> pandas.__version__
'0.20.3'
>>> f = open('./test.html')
>>> pd.read_html(f)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    pd.read_html(f)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 743, in _parse
    raise_with_traceback(retained)
  File "/usr/lib/python3.6/site-packages/pandas/compat/__init__.py", line 344, in raise_with_traceback
    raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <_io.TextIOWrapper name='/home/liam/test.html' mode='r' encoding='UTF-8'>

Problem description

Pandas attempts to invoke a series of parsers on HTML documents, returning when one produces a result, and continuing to the next on error. This works fine when passing a path or entire document to read_html(), but when an IO object is passed, the subsequent parsers will be reading from a file whose read cursor is at EOF, producing an inscrutable 'no text parsed from document' error.

This can easily be fixed by rewinding the file with seek(0) before continuing to the next parser (will add PR shortly).

Expected Output

[                                       0
0  poorly-escaped cell with an & oh noes]

Output of `pd.show_versions()`

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: e1dabf37645f0fcabeed1d845a0ada7b32415606
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.6-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0rc1+36.ge1dabf376.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.27.2
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

erinzm · 2017-10-25T04:48:43Z

Interestingly, when adding a test for the patch, I noticed that stuffing that test document into a StringIO seems to work? lxml still fails, but the fallback on html5lib/bs4 rewinds properly.

I'll investigate that further tomorrow.

xflr6 · 2017-10-28T11:25:47Z

Same issue: Cannot read_html() a webpage directly from an urlopen() result when lxml does not like it:

>>> from urllib.request import urlopen
>>> import pandas as pd
>>> url = 'http://en.wikipedia.org/wiki/Matplotlib'
>>> assert pd.read_html(urlopen(url), 'Advantages', 'bs4')  # works with bs4 alone
>>> pd.read_html(urlopen(url), 'Advantages')
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    pd.read_html(urlopen(url), 'Advantages')
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 915, in read_html
    keep_default_na=keep_default_na)
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 749, in _parse
    raise_with_traceback(retained)
  File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 367, in raise_with_traceback
    raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <http.client.HTTPResponse object at 0x0000000005621358>

Note that one cannot do .seek(0) on the urlopen return value (so the fix needs to be more complex).

I think lxml does something slightly different with StringIOs. So here is a self-contained test case:

>>> import pandas as pd
>>> from mock import Mock
>>> def mock_urlopen(data, url='http://spam'):
        return Mock(**{'geturl.return_value': url, 'read.side_effect': [data, '', '']})

>>> good = mock_urlopen('<table><tr><td>spam<br />eggs</td></tr></table>')
>>> bad = mock_urlopen('<table><tr><td>spam<wbr />eggs</td></tr></table>')
>>> assert pd.read_html(good)
>>> assert pd.read_html(bad, flavor='bs4')
>>> bad.reset_mock()
>>> pd.read_html(bad)
Traceback (most recent call last):
...
ValueError: No text parsed from document: <Mock id='85948960'>
>>> bad.mock_calls
[call.geturl(),
 call.tell(),
 call.read(4000),
 call.decode('ascii', 'strict'),
 call.decode().decode('ascii', 'strict'),
 call.decode().decode().find(':'),
 call.read()]

The second .read()-call is the one where bs4 takes over and fails parsing the empty string.

xflr6 · 2017-10-28T11:38:48Z

Minimal amendment: reset_mock() does not rewind read.side_effect so here is the same with a fresh mock:

>>> bad = mock_urlopen('<table><tr><td>spam<wbr />eggs</td></tr></table>')
>>> pd.read_html(bad)
Traceback (most recent call last):
...
ValueError: No text parsed from document: <Mock id='50837656'>
>>> bad.mock_calls
[call.geturl(),
 call.tell(),
 call.read(4000),
 call.read(3952),
 call.decode('ascii', 'strict'),
 call.decode().decode('ascii', 'strict'),
 call.decode().decode().find(':'),
 call.read()]

Again, the last .read()-call is from bs4

erinzm · 2017-10-28T16:33:14Z

The only way to rewind a urlopen is re-requesting it or buffering it, unfortunately. This becomes a much more complex patch, then 😦

jreback · 2017-10-28T18:03:22Z

So i suppose that the try next parser should raise if we only have a filehandle (and not a path). would take that as a PR.

erinzm · 2017-10-28T18:17:03Z

We can seek for some IO handles, though. I don't see any reason not to add something like

if hasattr(io, 'seek'):
    io.seek(0)

and raise a warning if

hasattr(io, 'read') and not hasattr(io, 'seek')

xflr6 · 2017-10-28T18:29:53Z

Sounds good to me. I think @jreback means that the raise (possibly after checking for seek) should occur in the branch after the first parser fails, so it makes the current behaviour more official/transparent (give a better error message). The user can then select/try a different flavor (maybe the error message can hint at that) .

erinzm · 2017-10-28T19:59:51Z

ah, you're talking about ditching the fallthrough to the next parser entirely?

xflr6 · 2017-10-28T21:35:00Z

I thought for io handles (possibly only non-seekable ones). Does not occur with file names, right?

erinzm · 2017-10-28T22:04:53Z

Yep, since _read() reopens the file for each parser if you're passing in filenames.

…

On Sat, Oct 28, 2017 at 4:35 PM, Sebastian Bank ***@***.***> wrote: I thought for io handles (possibly only non-seekable ones). Does not occur with file names, right? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#17975 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEa8SQpni0IETbGKuerchH4awQd7JEV5ks5sw54QgaJpZM4QFbJ2> .

jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Oct 28, 2017

jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Novice labels Oct 28, 2017

jreback added this to the Next Major Release milestone Oct 28, 2017

erinzm mentioned this issue Oct 28, 2017

read_html(): rewinding [wip] #18017

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.22.0 Nov 1, 2017

jreback closed this as completed in #18017 Nov 1, 2017

roadswitcher mentioned this issue Oct 12, 2022

STYLE: Fix or suppress non-iterator-returned errors from pylint #49036

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

erinzm commented Oct 25, 2017

erinzm commented Oct 25, 2017

xflr6 commented Oct 28, 2017

xflr6 commented Oct 28, 2017 •

edited

Loading

erinzm commented Oct 28, 2017

jreback commented Oct 28, 2017

erinzm commented Oct 28, 2017

xflr6 commented Oct 28, 2017

erinzm commented Oct 28, 2017

xflr6 commented Oct 28, 2017

erinzm commented Oct 28, 2017 via email •

edited

Loading

BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

Comments

erinzm commented Oct 25, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

erinzm commented Oct 25, 2017

xflr6 commented Oct 28, 2017

xflr6 commented Oct 28, 2017 • edited Loading

erinzm commented Oct 28, 2017

jreback commented Oct 28, 2017

erinzm commented Oct 28, 2017

xflr6 commented Oct 28, 2017

erinzm commented Oct 28, 2017

xflr6 commented Oct 28, 2017

erinzm commented Oct 28, 2017 via email • edited Loading

Output of `pd.show_versions()`

xflr6 commented Oct 28, 2017 •

edited

Loading

erinzm commented Oct 28, 2017 via email •

edited

Loading