Skip to content

fix for read_html with bs4 failing on table with header and one column #12975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.18.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -344,3 +344,5 @@ Bug Fixes


- Bug in ``fill_value`` is ignored if the argument to a binary operator is a constant (:issue `12723`)

- Bug in ``pd.read_html`` when using bs4 flavor and parsing table with a header and only one column (:issue `9178`)
6 changes: 4 additions & 2 deletions pandas/io/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,14 +356,16 @@ def _parse_raw_thead(self, table):
res = []
if thead:
res = lmap(self._text_getter, self._parse_th(thead[0]))
return np.array(res).squeeze() if res and len(res) == 1 else res
return np.atleast_1d(
np.array(res).squeeze()) if res and len(res) == 1 else res

def _parse_raw_tfoot(self, table):
tfoot = self._parse_tfoot(table)
res = []
if tfoot:
res = lmap(self._text_getter, self._parse_td(tfoot[0]))
return np.array(res).squeeze() if res and len(res) == 1 else res
return np.atleast_1d(
np.array(res).squeeze()) if res and len(res) == 1 else res

def _parse_raw_tbody(self, table):
tbody = self._parse_tbody(table)
Expand Down
25 changes: 25 additions & 0 deletions pandas/io/tests/test_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,31 @@ def test_empty_tables(self):
res2 = self.read_html(StringIO(data2))
assert_framelist_equal(res1, res2)

def test_header_and_one_column(self):
"""
Don't fail with bs4 when there is a header and only one column
as described in issue #9178
"""
data = StringIO('''<html>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

<body>
<table>
<thead>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to replicate the error message though:

In [1]:         data = StringIO('''<html>
   ...:             <body>
   ...:              <table>
   ...:                 <thead>
   ...:                     <tr>
   ...:                         <th>Header</th>
   ...:                     </tr>
   ...:                 </thead>
   ...:                 <tbody>
   ...:                     <tr>
   ...:                         <td>first</td>
   ...:                     </tr>
   ...:                 </tbody>
   ...:             </table>
   ...:             </body>
   ...:         </html>''')

In [2]: pd.read_html(data)
Out[2]: 
[  Header
 0  first]

In [3]: pd.read_html(data)[0]
Out[3]: 
  Header
0  first

In [4]: pd.__version__
Out[4]: u'0.18.0'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should't this raise a similar error?

Copy link
Author

@hnykda hnykda Apr 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to add flavor='bs4'. When I do:

In [2]: pandas.__version__
Out[2]: '0.18.0'

In [3]: s = '''<html>
            <body>
             <table>
                <thead>
                    <tr>
                        <th>Header</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td>first</td>
                    </tr>
                </tbody>
            </table>
            </body>
        </html>'''

In [4]: pandas.read_html(s, flavor="bs4")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5f8f3ea79c02> in <module>()
----> 1 pandas.read_html(s, flavor="bs4")

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding)
    868     _validate_header_arg(header)
    869     return _parse(flavor, io, match, header, index_col, skiprows,
--> 870                   parse_dates, tupleize_cols, thousands, attrs, encoding)

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, parse_dates, tupleize_cols, thousands, attrs, encoding)
    741                                       parse_dates=parse_dates,
    742                                       tupleize_cols=tupleize_cols,
--> 743                                       thousands=thousands))
    744         except StopIteration:  # empty table
    745             continue

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _data_to_frame(data, header, index_col, skiprows, parse_dates, tupleize_cols, thousands)
    622 
    623     # fill out elements of body that are "ragged"
--> 624     _expand_elements(body)
    625 
    626     tp = TextParser(body, header=header, index_col=index_col,

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/io/html.py in _expand_elements(body)
    599 
    600 def _expand_elements(body):
--> 601     lens = Series(lmap(len, body))
    602     lens_max = lens.max()
    603     not_max = lens[lens != lens_max]

/home/dan/.local/opt/miniconda3/envs/mathbs/lib/python3.5/site-packages/pandas/compat/__init__.py in lmap(*args, **kwargs)
    116 
    117     def lmap(*args, **kwargs):
--> 118         return list(map(*args, **kwargs))
    119 
    120     def lfilter(*args, **kwargs):

TypeError: len() of unsized object

while using patched version it works:

In [3]: import pandas

In [4]: pandas.__version__
Out[4]: '0.18.0+145.g9b6f9f2'

In [5]: s = '''<html>
            <body>
             <table>
                <thead>
                    <tr>
                        <th>Header</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td>first</td>
                    </tr>
                </tbody>
            </table>
            </body>
        </html>'''

In [6]: pandas.read_html(s, flavor="bs4")
Out[6]: 
[  Header
 0  first]

<tr>
<th>Header</th>
</tr>
</thead>
<tbody>
<tr>
<td>first</td>
</tr>
</tbody>
</table>
</body>
</html>''')
expected = DataFrame(data={'Header': 'first'}, index=[0])
result = self.read_html(data)[0]
tm.assert_frame_equal(result, expected)

def test_tfoot_read(self):
"""
Make sure that read_html reads tfoot, containing td or th.
Expand Down