-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add support to read_json
to encode character escape hex codes to utf-8 characters
#41521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
read_json
renders angle bracketed hex codes for accent utf-8 characters
Is this a In [26]: from requests import get
In [27]: data = get("http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json").json()
In [28]: data[884]['Title']
Out[28]: 'Misc Functions of Eduard Sz<c3><b6>cs'
In [29]: data[213]['Author']
Out[29]: 'Kirill M<c3><bc>ller [aut, cre]'
In [30]: data[336]['Maintainer']
Out[30]: 'H<c3><a9>l<c3><a8>ne Morlon <[email protected]>' |
Hmmm....good point @asishm. Confirmed with standard library import urllib.request
import json
rdata_url = "http://public-r-data.s3-website-us-east-1.amazonaws.com/ghcran.json"
with urllib.request.urlopen(rdata_url) as req:
rdata = json.loads(req.read())
print(rdata[884]['Title'])
# Misc Functions of Eduard Sz<c3><b6>cs
print(rdata[213]['Author'])
# Kirill M<c3><bc>ller [aut, cre]
print(rdata[336]['Maintainer'])
# H<c3><a9>l<c3><a8>ne Morlon <[email protected]> Given now we know this is not a pandas bug but general json handling in Python, should we consider an enhancement in |
read_json
renders angle bracketed hex codes for accent utf-8 charactersread_json
to encode character escape hex codes to utf-8 characters
Is your feature request related to a problem?See above problem description. Currently, Describe the solution you'd likeSee above expected output. Ideally, API breaking implicationsNo new extra arguments or wrapper methods in IO JSON but simply an additional re-encoding step in underlying parsing to DataFrame that converts accented characters as they should appear. Possibly, too, the counterpart for Describe alternatives you've consideredCurrent method (from posted solution in StackOverflow post) uses a user defined method call with def reencode(string):
def is_hex(i):
return (48 <= i <= 57) or (97 <= i <= 102)
def to_bytes(arr):
if len(arr) != 4:
return None
a, b, c, d = arr
if a == 60 and is_hex(b) and is_hex(c) and d == 62:
return bytes.fromhex(chr(b) + chr(c))
return None
old = string.encode('ascii')
new = bytearray()
i = 0
while i < len(old):
b = to_bytes(old[i:i+4])
if b:
new.extend(b)
i += 4
else:
new.append(old[i])
i += 1
try:
retvalue = new.decode('utf8')
except UnicodeDecodeError as ue:
retvalue = string
return retvalue
json_df["Title"] = json_df["Title"].apply(reencode)
json_df["Author"] = json_df["Author"].apply(reencode)
json_df["Maintainer"] = json_df["Maintainer"].apply(reencode)
json_df.loc[884, "Title"]
# Misc Functions of Eduard Szöcs
json_df.loc[213, "Author"]
# Kirill Müller [aut, cre]
json_df.loc[336, "Maintainer"]
# Hélène Morlon <[email protected]> |
Since it appears an issue in urllib, it would be best if this would be addressed upstream in Python so closing |
Code Sample
Problem description
All UTF-8 encoding should render to their true character form and not hex code representation. Users need reliability in text value consistency from json format to pandas containers. Looking at some source code, IO JSON appears to use the
ujson
C extension which may be causing above behavior. I posted same question on StackOverflow with an interesting workaround answer.For reference see https://www.utf8-chartable.de/ for table of hex codes and corresponding characters. For c2-c9 codes, use drop down values:
U+0000 ... U+007F: Basic Latin
,U+0080 ... U+00FF: Latin-1 Supplement
,U+0100 ... U+017F: Latin Extended-A
,U+0180 ... U+024F: Latin Extended-B
.Expected Output
All angle bracketed hex codes rendered to true accent or diatric characters.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : fa3dbc1
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.0.dev0+1672.gfa3dbc117f
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 49.6.0.post20210108
Cython : 0.29.23
pytest : 6.2.3
hypothesis : 6.10.1
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : 1.4.0
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.23.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.04.0
fastparquet : 0.5.0
gcsfs : 2021.04.0
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.0
pyxlsb : None
s3fs : 2021.04.0
scipy : 1.6.3
sqlalchemy : 1.4.12
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.17.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1
The text was updated successfully, but these errors were encountered: