-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: since pandas==1.1.0 pd.read_json() fails for strings that look similar to fsspec_url #36271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pd.read_json() fails currently for strings that look like fsspec_url and contain "://". adding another condition to fix this at least in most cases
This problem is a pain with URL in your data. That's not happening with big files with a json per row, and the file is not enclosed with
You can do: |
The issue is still persistent in Pandas 1.2.5. Downgrading to 1.0.5 fixes the issue, because there is no fssspec_url like in the newer versions. |
The offensive code is in pandas/io/common.py:
Compare this with the more conservative:
So perhaps a fix would be to use However, I don't know why |
The issue is still persistent in Pandas 1.3.3 Possible short-term workarounds at #43594 Perhaps a safer change would be to change and "://" in url
and not url.startswith(("http://", "https://")) with and bool(re.match(r'\w+://', url))
and not url.startswith(("http://", "https://")) This would cater for I was thinking of something like: and not url.startswith(tuple(_VALID_URLS)) but that would disallow Some doctests in """
Returns true if the given URL looks like
something fsspec can handle.
>>> is_fsspec_url('http://www.example.com/file.ext')
False
>>> is_fsspec_url('s3://example.com/file.ext')
True
>>> is_fsspec_url('file://file.ext')
True
""" might be a good idea? |
I think Pandas isn't supposed to need to know what schemes fsspec handles, because fsspec might add support for new things in future and pandas wants to inherit that automatically. So any kind of test There is a PR buried in the history above, #36273 which tried to be clever about this auto-detection, but it was abandoned after taking a couple of different directions. That's the reason I suggested an extremely trivial (non-clever) option, based on the reasoning that if the |
The current and "://" in url by making that more restrictive. We could use and bool(re.match(r'\w+://', url)) but there's no # There's a :// in the URL, and if there's a double-quote, it's after the protocol
and '://' in url and (url.find('"') > url.find('://')) if '"' in url else True I also haven't submitted any patches to pandas, but am happy to or to work with you to. |
We could perhaps submit two PRs, one using For the more precise technique with a regex: '\w' isn't exactly right, because the BNF from RFC 3986 is I think in my own code I'll probably just stick in the |
I can resurrect a PR, i'd love this fixed. |
Yes, the and bool(re.match(r'[A-Za-z0-9+.\-]+://', url)) (The dash doesn't need escaping, but I do like to do it for safety) |
This issue is impacting us right now. I've opened a PR that checks the schema according to RFC mentioned by @steve-mavens |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
output:
Problem description
The method
pd.read_json()
is widely used and accepts either a path or a json string. Since pandas==1.1.0 passing a string containing a json input is often interpreted as a fssspec_url and results in an error.Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: