Skip to content

Commit 2d03b15

Browse files
committed
io.html.read_html(): rewind seekable io objects when parsers fail
If lxml has read to the end of a file and then errored, bs4/html5lib won't rewind it before trying to parse again, and will throw a `ValueError: No text parsed from document`. This patch fixes this issue, by rewinding the file object when a parser fails. If the object was IO-ish but not seekable, we throw an error notifying the user and asking them to try a different flavor.
1 parent e1dabf3 commit 2d03b15

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

pandas/io/html.py

+13
Original file line numberDiff line numberDiff line change
@@ -742,6 +742,19 @@ def _parse(flavor, io, match, attrs, encoding, **kwargs):
742742
try:
743743
tables = p.parse_tables()
744744
except Exception as caught:
745+
# if `io` is an io-like object, check if it's seekable
746+
# and try to rewind it before trying the next parser
747+
if hasattr(io, 'seekable') and io.seekable():
748+
io.seek(0)
749+
750+
# if we couldn't rewind it, let the user know
751+
if hasattr(io, 'seekable') and not io.seekable():
752+
raise ValueError('The favor {} failed to parse your input. '
753+
'Since you passed a non-rewindable file '
754+
'object, we can\'t rewind it to try '
755+
'another parser. Try read_html() with a '
756+
'different flavor.'.format(flav))
757+
745758
retained = caught
746759
else:
747760
break

0 commit comments

Comments
 (0)