-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Crash on read_html(url, flavor="bs4") if table has only one column #9178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've been investingating this problem and I narrowed the problem down. Basically if you use BeautifulSoup4 as the backend and you have a table with table header with only one column, the _parse_raw_thead will cause the aforementioned error. In my case it was happening because there was multiple of same ids in the table body which caused lxml to error and switch to bs4. I wonder why the whole <table>
<thead>
<tr>
<th>Header</th>
</tr>
</thead>
<tbody>
<tr>
<td>first</td>
</tr>
</tbody>
</table> |
looks like a bug. care to do a pull-request? |
I would do but I'm not familiar enough with the code to fix this. The author obviously had a reason to do
instead of simply returning |
this could be fixed by adding np.atleast_1d after the squeeze call |
Here's a pull request to fix this |
I was trying to read a package tracking table from finnish post office's website and I got
I isolated the offending table into this script:
https://gist.github.com/boarpig/de4044f4188fac700c68
The problem seems to be related to
parse_raw_thead
functionWhere
res
contains['Tapahtumat']
which comes out of numpy array creation asarray('Tapahtumat', dtype='<U10')
which then produces previously mentioned error because you cannot take a len from that.
The text was updated successfully, but these errors were encountered: