[BUG]Fix read_html error when URL include Unicode #50259

kouml · 2022-12-14T15:48:26Z

closes BUG: read_html produce UnicodeEncodeError for multibyte URL #47899 and Python Pandas read_html fails when reading tables from Wikipedia #21499
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Let's tackle the last one after getting the approval for this PR.
I fixed the unicode URL issue for read_html function by converting the unicode-style URL.

# url:https://en.wikipedia.org/wiki/2013–14_Premier_League
# escaped:https://en.wikipedia.org/wiki/2013%E2%80%9314_Premier_League

kouml · 2022-12-20T14:53:37Z

Sorry, I'll fix some tests first.

kouml · 2022-12-20T20:02:53Z

CI failure is just rate limit exceeded issue. I'll rerun later.

requests.exceptions.HTTPError: 403 Client Error: rate limit exceeded for url: https://api.github.com/users/wesm

kouml · 2022-12-21T15:33:07Z

Now, 32 Bit Linux test is canceled due to a timeout issue, and I think it is debugging in the below PR.
#50376

however, failed and canceled tests are irrelevant issues for this PR, so after fixing the above test, I think this PR is fine.
so, Please review the PR when you have time. Thanks in advance.

kouml · 2022-12-23T17:20:03Z

All tests are passed now. Could someone help me to review it?@phofl, @mroeschke

WillAyd · 2022-12-29T23:59:30Z

I'm not sure this is something we should be special-casing. Looks like there has been some upstream conversation on this already in Python

https://bugs.python.org/issue3991

Not an expert on the RFCs but I think we would just want to defer to the language

FYI since the characters included are non-printable the test case is a bit deceiving. If you remove the non-printable characters everything works fine

kouml · 2023-01-19T16:04:40Z

@WillAyd Sorry for the late reply due to my new year holidays.
Thanks for the RFC confirmation, I didn't notice that. however, That makes sense to me.
I agreed with your suggestion, so It's better to defer to the Python language.

Let's close the related issues or pull requests, and Let's clearly state this is the spec and not the bug.

kouml added 2 commits December 15, 2022 00:16

Fix unicode error

0cba21b

fix order

c3a0780

kouml changed the title ~~[BUG]Fix unicode error~~ [BUG]Fix read_html error when URL include Unicode Dec 14, 2022

kouml and others added 9 commits December 15, 2022 00:55

fix E501

e16efa9

fix pre-commit

153a75b

fix mypy

a922bbf

add cast for mypy

def7e41

fix order

fdff414

fix type

78dde66

remove unused type

e7ff0a8

Merge branch 'main' into fix_bug_unicode_read_html

5cdf08d

Merge branch 'main' into fix_bug_unicode_read_html

e39f612

kouml marked this pull request as draft December 20, 2022 14:53

kouml added 2 commits December 21, 2022 02:24

fix Union type

8a0f56a

Merge branch 'main' into fix_bug_unicode_read_html

1e5ae6a

kouml marked this pull request as ready for review December 20, 2022 18:10

Merge branch 'main' into fix_bug_unicode_read_html

0bd3b2d

mroeschke added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Dec 20, 2022

Merge branch 'main' into fix_bug_unicode_read_html

189ee40

Merge branch 'main' into fix_bug_unicode_read_html

3e78b9d

kouml closed this Jan 19, 2023

This was referenced Jan 19, 2023

BUG: read_html produce UnicodeEncodeError for multibyte URL #47899

Closed

Python Pandas read_html fails when reading tables from Wikipedia #21499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Fix read_html error when URL include Unicode #50259

[BUG]Fix read_html error when URL include Unicode #50259

kouml commented Dec 14, 2022 •

edited

Loading

kouml commented Dec 20, 2022

kouml commented Dec 20, 2022

kouml commented Dec 21, 2022 •

edited

Loading

kouml commented Dec 23, 2022 •

edited

Loading

WillAyd commented Dec 29, 2022

kouml commented Jan 19, 2023

[BUG]Fix read_html error when URL include Unicode #50259

[BUG]Fix read_html error when URL include Unicode #50259

Conversation

kouml commented Dec 14, 2022 • edited Loading

kouml commented Dec 20, 2022

kouml commented Dec 20, 2022

kouml commented Dec 21, 2022 • edited Loading

kouml commented Dec 23, 2022 • edited Loading

WillAyd commented Dec 29, 2022

kouml commented Jan 19, 2023

kouml commented Dec 14, 2022 •

edited

Loading

kouml commented Dec 21, 2022 •

edited

Loading

kouml commented Dec 23, 2022 •

edited

Loading