Skip to content

Python Pandas read_html fails when reading tables from Wikipedia #21499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dscience7 opened this issue Jun 15, 2018 · 7 comments
Open

Python Pandas read_html fails when reading tables from Wikipedia #21499

dscience7 opened this issue Jun 15, 2018 · 7 comments
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@dscience7
Copy link

I am trying to read the tables from a Wikipedia page using the following code:

import pandas as pd
pd.read_html('https://en.wikipedia.org/wiki/2013–14_Premier_League')

Doing that generates the following error:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in    position 14: ordinal not in range(128)

I have tried

pd.read_html('https://en.wikipedia.org/wiki/2013–14_Premier_League', encoding='utf-8')

But still get the same error. The following works:

import requests
r = requests.get('https://en.wikipedia.org/wiki/2017–18_Premier_League')
c = r.content
dfs = pd.read_html(c)

What I want to know is how to get pd.read_html() to work directly on the url without requests. What is it that I don't understand about encoding or is this a problem with Pandas?

I am running an Anaconda distribution of Pandas 0.21.1 and Python 3.5.4. Thanks for any help.

@WillAyd
Copy link
Member

WillAyd commented Jun 15, 2018

Hmm interesting. Looks like this is still an issue on master even specifying the encoding to be used:

>>> pd.read_html('https://en.wikipedia.org/wiki/2013–14_Premier_League', encoding='utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 14: ordinal not in range(128)

Investigation and PRs are always welcome

@WillAyd WillAyd added IO HTML read_html, to_html, Styler.apply, Styler.applymap Bug labels Jun 15, 2018
@hannah2048
Copy link

https://stackoverflow.com/questions/39229439/encoding-error-when-reading-url-with-urllib

As seen in this similar issue, urllib only works with ASCII requests. To remedy, I used the Requests library (http://docs.python-requests.org/en/master/).

@Liam3851
Copy link
Contributor

FWIW the sample call works fine under Python 2.7.15 but not Python 3.6.5. Choice of engine doesn't matter, however.

@StepanSushko
Copy link

I used the following solution:

import requests
url = "https://ru.wikipedia.org/wiki/Города_России_с_населением_более_500_тысяч_человек"
r = requests.get(url, auth=('user', 'pass'))
website = r.text

import pandas as pd
tables = pd.read_html( website, encoding="UTF-8")

City_pop = tables[4]

@jadore801120
Copy link

@StepanSushko 's solution works for me.

@kouml
Copy link

kouml commented Jul 23, 2022

I investigate this error, and I personally think it should be fixed. Let me take it and send PR.
In urllib's library, all URL characters should be ASCII, so what you need is just pass the escaped URL if you have multibyte.
also, urllib is the Python's standard library.

Reproduce

import urllib
import pandas

url = "https://en.wikipedia.org/wiki/2013–14_Premier_League"
url_escaped = urllib.parse.quote_plus(url, "/:?=&") # need this line
print(f"url:{url}")
print(f"escaped:{url_escaped}")
tables = pd.read_html(url_escaped, encoding="UTF-8") # non-error
tables = pd.read_html(url, encoding="UTF-8") # error
$ url:https://en.wikipedia.org/wiki/2013–14_Premier_League
$ escaped:https://en.wikipedia.org/wiki/2013%E2%80%9314_Premier_League

Solution

url = urllib.parse.quote_plus(url, "/:?=&") # need this line
tables = pd.read_html(url, encoding="UTF-8") # non-error

so, we can just use urllib instead of requests.

@kouml
Copy link

kouml commented Jan 19, 2023

I'm gonna try to implement the Unicode handling feature, but because of Python's RFC design, This is treated as the spec rather than the bug.

details are written here.
#50259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants