Crash on read_html(url, flavor="bs4") if table has only one column #9178

boarpig · 2014-12-31T19:39:48Z

I was trying to read a package tracking table from finnish post office's website and I got

Traceback (most recent call last):
  File "./posti.py", line 69, in <module>
    dfs = read_html(html)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 851, in read_html
    parse_dates, tupleize_cols, thousands, attrs, encoding)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 721, in _parse
    infer_types, parse_dates, tupleize_cols, thousands))
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 609, in _data_to_frame
    _expand_elements(body)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 586, in _expand_elements
    lens = Series(lmap(len, body))
  File "/usr/lib/python3.4/site-packages/pandas/compat/__init__.py", line 87, in lmap
    return list(map(*args, **kwargs))
TypeError: len() of unsized object

I isolated the offending table into this script:

https://gist.github.com/boarpig/de4044f4188fac700c68

The problem seems to be related to parse_raw_thead function

def _parse_raw_thead(self, table):
    thead = self._parse_thead(table)
    res = []
    if thead:
        res = lmap(self._text_getter, self._parse_th(thead[0]))
    return np.array(res).squeeze() if res and len(res) == 1 else res

Where res contains ['Tapahtumat'] which comes out of numpy array creation as

array('Tapahtumat', dtype='<U10')

which then produces previously mentioned error because you cannot take a len from that.

The text was updated successfully, but these errors were encountered:

boarpig · 2015-01-02T12:33:23Z

I've been investingating this problem and I narrowed the problem down. Basically if you use BeautifulSoup4 as the backend and you have a table with table header with only one column, the _parse_raw_thead will cause the aforementioned error. In my case it was happening because there was multiple of same ids in the table body which caused lxml to error and switch to bs4.

I wonder why the whole np.array(res) even exists. Below is the simplest valid table that will cause the aforementioned error when doing pandas.read_html(html, flavor=bs4)

 <table>
    <thead>
        <tr>
            <th>Header</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>first</td>
        </tr>
    </tbody>
</table>

jreback · 2015-01-02T17:04:08Z

looks like a bug. care to do a pull-request?

boarpig · 2015-01-03T07:19:56Z

I would do but I'm not familiar enough with the code to fix this. The author obviously had a reason to do

return np.array(res).squeeze() if res and len(res) == 1 else res

instead of simply returning return res. Perhaps the assumption was that if len(res) is 1 it must be nested list like [['first', 'second']]and you want ['first', 'second']. Perhaps someone with more insingh can help with this. Using np.array().squeeze() seems like an odd way to flatten a list.

cpcloud · 2015-01-03T19:09:09Z

this could be fixed by adding np.atleast_1d after the squeeze call

boarpig · 2015-01-03T19:26:28Z

Here's a pull request to fix this
#9194

boarpig changed the title ~~"len() of unsized object" while using read_html(url)~~ Crash on read_html(url, flavor="bs4") if table has only one column Jan 2, 2015

jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jan 2, 2015

jreback added the Bug label Jan 2, 2015

jreback added this to the 0.16.0 milestone Jan 2, 2015

boarpig mentioned this issue Jan 5, 2015

BUG: read_html with a single column table #9178 #9194

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

hnykda mentioned this issue Apr 24, 2016

fix for read_html with bs4 failing on table with header and one column #12975

Closed

4 tasks

jreback modified the milestones: 0.18.1, Next Major Release Apr 25, 2016

jreback closed this as completed in bec5272 Apr 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on read_html(url, flavor="bs4") if table has only one column #9178

Crash on read_html(url, flavor="bs4") if table has only one column #9178

boarpig commented Dec 31, 2014

boarpig commented Jan 2, 2015

jreback commented Jan 2, 2015

boarpig commented Jan 3, 2015

cpcloud commented Jan 3, 2015

boarpig commented Jan 3, 2015

Crash on read_html(url, flavor="bs4") if table has only one column #9178

Crash on read_html(url, flavor="bs4") if table has only one column #9178

Comments

boarpig commented Dec 31, 2014

boarpig commented Jan 2, 2015

jreback commented Jan 2, 2015

boarpig commented Jan 3, 2015

cpcloud commented Jan 3, 2015

boarpig commented Jan 3, 2015