Skip to content

[BUG]Fix read_html error when URL include Unicode #50259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

kouml
Copy link

@kouml kouml commented Dec 14, 2022

Let's tackle the last one after getting the approval for this PR.
I fixed the unicode URL issue for read_html function by converting the unicode-style URL.

# url:https://en.wikipedia.org/wiki/2013–14_Premier_League
# escaped:https://en.wikipedia.org/wiki/2013%E2%80%9314_Premier_League

@kouml kouml changed the title [BUG]Fix unicode error [BUG]Fix read_html error when URL include Unicode Dec 14, 2022
@kouml kouml marked this pull request as draft December 20, 2022 14:53
@kouml
Copy link
Author

kouml commented Dec 20, 2022

Sorry, I'll fix some tests first.

@kouml kouml marked this pull request as ready for review December 20, 2022 18:10
@mroeschke mroeschke added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Dec 20, 2022
@kouml
Copy link
Author

kouml commented Dec 20, 2022

CI failure is just rate limit exceeded issue. I'll rerun later.

requests.exceptions.HTTPError: 403 Client Error: rate limit exceeded for url: https://api.github.com/users/wesm

@kouml
Copy link
Author

kouml commented Dec 21, 2022

Now, 32 Bit Linux test is canceled due to a timeout issue, and I think it is debugging in the below PR.
#50376

however, failed and canceled tests are irrelevant issues for this PR, so after fixing the above test, I think this PR is fine.
so, Please review the PR when you have time. Thanks in advance.

@kouml
Copy link
Author

kouml commented Dec 23, 2022

All tests are passed now. Could someone help me to review it?@phofl, @mroeschke

@WillAyd
Copy link
Member

WillAyd commented Dec 29, 2022

I'm not sure this is something we should be special-casing. Looks like there has been some upstream conversation on this already in Python

https://bugs.python.org/issue3991

Not an expert on the RFCs but I think we would just want to defer to the language

FYI since the characters included are non-printable the test case is a bit deceiving. If you remove the non-printable characters everything works fine

@kouml
Copy link
Author

kouml commented Jan 19, 2023

@WillAyd Sorry for the late reply due to my new year holidays.
Thanks for the RFC confirmation, I didn't notice that. however, That makes sense to me.
I agreed with your suggestion, so It's better to defer to the Python language.

Let's close the related issues or pull requests, and Let's clearly state this is the spec and not the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_html produce UnicodeEncodeError for multibyte URL
4 participants