-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
fix for read_html with bs4 failing on table with header and one column #12975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Should be OK now. |
""" | ||
Don't fail with bs4 when there is a header and only one column | ||
""" | ||
data = StringIO('''<html> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
small comments. ping when green. |
Done. (I wasn't sure if I can use |
data = StringIO('''<html> | ||
<body> | ||
<table> | ||
<thead> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to replicate the error message though:
In [1]: data = StringIO('''<html>
...: <body>
...: <table>
...: <thead>
...: <tr>
...: <th>Header</th>
...: </tr>
...: </thead>
...: <tbody>
...: <tr>
...: <td>first</td>
...: </tr>
...: </tbody>
...: </table>
...: </body>
...: </html>''')
In [2]: pd.read_html(data)
Out[2]:
[ Header
0 first]
In [3]: pd.read_html(data)[0]
Out[3]:
Header
0 first
In [4]: pd.__version__
Out[4]: u'0.18.0'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should't this raise a similar error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to add flavor='bs4'
. When I do:
In [2]: pandas.__version__
Out[2]: '0.18.0'
In [3]: s = '''<html>
<body>
<table>
<thead>
<tr>
<th>Header</th>
</tr>
</thead>
<tbody>
<tr>
<td>first</td>
</tr>
</tbody>
</table>
</body>
</html>'''
In [4]: pandas.read_html(s, flavor="bs4")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-5f8f3ea79c02> in <module>()
----> 1 pandas.read_html(s, flavor="bs4")
/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding)
868 _validate_header_arg(header)
869 return _parse(flavor, io, match, header, index_col, skiprows,
--> 870 parse_dates, tupleize_cols, thousands, attrs, encoding)
/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, parse_dates, tupleize_cols, thousands, attrs, encoding)
741 parse_dates=parse_dates,
742 tupleize_cols=tupleize_cols,
--> 743 thousands=thousands))
744 except StopIteration: # empty table
745 continue
/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _data_to_frame(data, header, index_col, skiprows, parse_dates, tupleize_cols, thousands)
622
623 # fill out elements of body that are "ragged"
--> 624 _expand_elements(body)
625
626 tp = TextParser(body, header=header, index_col=index_col,
/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _expand_elements(body)
599
600 def _expand_elements(body):
--> 601 lens = Series(lmap(len, body))
602 lens_max = lens.max()
603 not_max = lens[lens != lens_max]
/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/compat/__init__.py in lmap(*args, **kwargs)
116
117 def lmap(*args, **kwargs):
--> 118 return list(map(*args, **kwargs))
119
120 def lfilter(*args, **kwargs):
TypeError: len() of unsized object
while using patched version it works:
In [3]: import pandas
In [4]: pandas.__version__
Out[4]: '0.18.0+145.g9b6f9f2'
In [5]: s = '''<html>
<body>
<table>
<thead>
<tr>
<th>Header</th>
</tr>
</thead>
<tbody>
<tr>
<td>first</td>
</tr>
</tbody>
</table>
</body>
</html>'''
In [6]: pandas.read_html(s, flavor="bs4")
Out[6]:
[ Header
0 first]
@hnykda ahh I see. we are testing with multiple flavors. I think we default to |
Exactly. Everything is green. |
thanks @hnykda |
git diff upstream/master | flake8 --diff
Fix as had been proposed in PR 9194, but this PR was closed because of tests missing. They are added now.