Skip to content

Bug with read_json from str reporting "Protocol not known" #43594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Peterl777 opened this issue Sep 16, 2021 · 7 comments
Closed

Bug with read_json from str reporting "Protocol not known" #43594

Peterl777 opened this issue Sep 16, 2021 · 7 comments
Labels
Bug Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize

Comments

@Peterl777
Copy link

read_json is supposed to parse from a str or from a file-like object. If parsing from a value string has the text http: then pandas crashes with a ValueError

Steps to reproduce:

>>> import json
>>> import pandas as pd
>>> s = '["http://www.example.com"]'
>>> json.loads(s)
['http://www.example.com']          # Proves the JSON is valid
>>> pd.read_json(s)
....
ValueError: Protocol not known: ["http

Work around 1:

Write the string to a file.

>>> with open('test.json', 'w') as f:
...     f.write(s)
>>> pd.read_json('test.json') 
                        0
0  http://www.example.com

Work around 2:

Wrap the string into a file-like object.

>>> import io
>>> pd.read_json(io.StringIO(s))
                        0
0  http://www.example.com

Versions

pandas 1.3.3 on Windows

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : 73c68257545b5f8530b7044f56647bd2db92e2ba
python           : 3.9.7.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19043
machine          : AMD64
processor        : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Australia.1252

pandas           : 1.3.3
numpy            : 1.21.2
pytz             : 2021.1
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 58.0.4
Cython           : 0.29.24
pytest           : None
hypothesis       : None
sphinx           : 4.2.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.1
IPython          : 7.27.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 2021.08.1
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.3
numexpr          : 2.7.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.1
sqlalchemy       : None
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@mzeitlin11
Copy link
Member

Thanks for reporting this @Peterl777. Does this fall under the same umbrella as #36271?

@mzeitlin11 mzeitlin11 added Bug Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize labels Sep 16, 2021
@Peterl777
Copy link
Author

Ah I hadn't seen that one. I searched, but didn't come across #36271. Probable duplicate. Appears to be the same.

But I don't think @steve-mavens proposed fix would fix my issue, as my string is in fact a valid URL.
So there are perhaps two issues - a string that happens to contain ://, and a string that contains a valid URL.

I note the error message in that report has changed - that report shows "ImportError: Missing optional dependency 'fsspec'". But that code and my code now both report "ValueError: Protocol not known".

I haven't delved into pandas internal, but I'm not clear why is_fsspec_url is even being called here?

@steve-mavens
Copy link

steve-mavens commented Sep 16, 2021

@Peterl777: I don't think your string is a valid URL. It is ["http://www.example.com"], so if it were a URL then its scheme would be ["http. Neither [ nor " is permitted in the scheme part of a URL (or in general a URI), per RFC 3986. My concern on the other ticket was just that I don't know fsspec, and so I don't know whether it accepts some things that strictly aren't valid URLs. Clearly it doesn't accept this one!

The different error message is because you happen not to have fsspec installed. It is an optional dependency of pandas, so it isn't installed automatically, but people who have it installed will see the "protocol not known" error because their string is passed to fsspec, which sees the scheme ["http (or whatever it might be for their input) and rejects it.

The reason is_fsspec_url is called, is so that any URL that fsspec knows how to download can be passed into read_json. So, if you pass in http://www.example.com then pandas can handle that without fsspec (not that it points to json data, but pandas will try). If you pass in s3://example-bucket/my_key.json then it uses fsspec to download it. The problem is that the current code is wildly optimistic about fsspec's ability to download stuff that cannot possibly be a URL! The check to identify (or in this case mis-identify) URLs is performed before pandas JSON-decodes anything.

@Peterl777
Copy link
Author

@steve-mavens Yes, true, but this is a JSON string, so it's an array (list) containing a string (JSON is always double-quoted), that contains a URL. So the actual URL is http://www.example.com. So pandas checks URLS before it decodes the JSON? I would be surprised that pandas would download or attempt to download anything that looked like a URL. For me, it's just text. (My real-world data had the URL in the middle of a bit of HTML which was the string in the JSON.)

Oh I didn't know that about the optional module. Will have to look into that.

But it doesn't explain the different behaviour if the JSON is coming from a file? (Or from a StringIO pretending to look like a file). That is - it works fine if it's in a file.

@steve-mavens
Copy link

steve-mavens commented Sep 16, 2021

read_json accepts either JSON data or a URL, and it tries (unsuccessfully) to auto-detect which you have passed. It also accepts a file-like object, in which case there is no ambiguity and it reads JSON data from it. The string that it is trying to identify as a url or not is ["http://www.example.com"], not http://www.example.com as you claim. If you don't trust me then check for yourself in a debugger ;-)

Possibly it is surprising, but the behaviour is documented: "The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.json.". The details of how it decides whether you've passed it a URL or JSON to parse are not documented.

@Peterl777
Copy link
Author

Peterl777 commented Sep 16, 2021

Ah! I get it. Wow, yes, it is getting the guessing of URL pretty wrong! And that explains why it's different when given a file (or file-like object) containing that string. And so yes, you're proposal appears to be likely to work! (with the caveat I haven't tried it, and like you don't know of any possible repercussions).

I'll mark this as a duplicate then, and make a note over at #36271.

FYI, the actual JSON string had a URL buried many levels deep, in the middle of a long string that contained HTML that had an <a href:

{
    "date": "2021-09-16",
    "time": "9:45am",
    "title": "Venues",
    "data": {
        "monitor": [
            {"Venue": "Alpine Hotel","Address":
             ...(5 more tags)... 
             "HealthAdviceHTML": "Anyone who attended this venue is a <a href='https://www.
             ...

@Peterl777
Copy link
Author

Duplicate of #36271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

3 participants