Skip to content

BUG: since pandas==1.1.0 pd.read_json() fails for strings that look similar to fsspec_url #36271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
tbachlechner opened this issue Sep 10, 2020 · 10 comments · Fixed by #44619
Closed
3 tasks done
Labels
Bug IO JSON read_json, to_json, json_normalize
Milestone

Comments

@tbachlechner
Copy link

tbachlechner commented Sep 10, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

input_json = '[{"0":"this is a string ://"}]'

print('Input json string: {}'.format(input_json))
print('URL? {}'.format(str(pd.io.common.is_url(input_json))))
print('fsspec? {}'.format(str(pd.io.common.is_fsspec_url(input_json))))

print(pd.read_json(input_json))

output:

Input json string: [{"0":"this is a string ://"}]
URL? False
fsspec? True
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-1-f3bc94ba133f> in <module>
      7 print('fsspec? {}'.format(str(pd.io.common.is_fsspec_url(input_json))))
      8 
----> 9 pd.read_json(input_json)

~/miniconda3/envs/eltest/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/miniconda3/envs/eltest/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

~/miniconda3/envs/eltest/lib/python3.7/site-packages/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
    592     compression = infer_compression(path_or_buf, compression)
    593     filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer(
--> 594         path_or_buf, encoding=encoding, compression=compression
    595     )
    596 

~/miniconda3/envs/eltest/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    201         if filepath_or_buffer.startswith("s3n://"):
    202             filepath_or_buffer = filepath_or_buffer.replace("s3n://", "s3://")
--> 203         fsspec = import_optional_dependency("fsspec")
    204 
    205         # If botocore is installed we fallback to reading with anon=True

~/miniconda3/envs/eltest/lib/python3.7/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
    108     except ImportError:
    109         if raise_on_missing:
--> 110             raise ImportError(msg) from None
    111         else:
    112             return None

ImportError: Missing optional dependency 'fsspec'.  Use pip or conda to install fsspec.

Problem description

The method pd.read_json() is widely used and accepts either a path or a json string. Since pandas==1.1.0 passing a string containing a json input is often interpreted as a fssspec_url and results in an error.

Expected Output

Input json string: [{"0":"this is a string ://"}]
URL? False
fsspec? False
                       0
0  this is a string ://

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
python           : 3.7.7.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-112-generic
Version          : #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.2
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.5 (dt dec pq3 ext lo64)
jinja2           : 3.0.0a1
IPython          : 7.18.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : 1.3.13
tables           : None
tabulate         : 0.8.7
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@tbachlechner tbachlechner added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2020
tbachlechner added a commit to tbachlechner/pandas that referenced this issue Sep 10, 2020
pd.read_json() fails currently for strings that look like fsspec_url and contain "://". adding another condition to fix this at least in most cases
@rhshadrach rhshadrach added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 11, 2020
artdgn added a commit to artdgn/movies-notifier that referenced this issue Sep 17, 2020
@Rutrus
Copy link

Rutrus commented Feb 17, 2021

This problem is a pain with URL in your data. That's not happening with big files with a json per row, and the file is not enclosed with [ ] or {[ ]}, so if the file is like:

{ row 1 }
{ row 2 }
{ row 3 }

You can do:
pandas.read_json("big_file.json", chunksize=200, lines=True)

@timothek
Copy link

timothek commented Jun 24, 2021

The issue is still persistent in Pandas 1.2.5. Downgrading to 1.0.5 fixes the issue, because there is no fssspec_url like in the newer versions.

@steve-mavens
Copy link

steve-mavens commented Sep 14, 2021

The offensive code is in pandas/io/common.py:

def is_fsspec_url(url: FilePathOrBuffer) -> bool:
    """
    Returns true if the given URL looks like
    something fsspec can handle
    """
    return (
        isinstance(url, str)
        and "://" in url
        and not url.startswith(("http://", "https://"))
    )

Compare this with the more conservative:

def is_url(url) -> bool:
    if not isinstance(url, str):
        return False
    return parse_url(url).scheme in _VALID_URLS

parse_url is a synonym for urllib.parse.urlparse, so parse_url('[{"0":"this is a string ://"}]').scheme is the empty string.

So perhaps a fix would be to use parse_url in is_fsspec_url. Or, if that ends up rejecting things that fsspec allows, then only exclude inputs where the double-quote character appears in the part before the "://", that is_fsspec_url is assuming is a URL scheme that fsspec might handle.

However, I don't know why is_fsspec_url was written that way in the first place, so I can't be 100% sure doing that wouldn't break something else (I have no idea the full range of what fsspec permits as a URL scheme). I'm also not (yet) a pandas contributor, so I'm not yet set up to do whatever the contribution guidelines would want me to do before submitting a PR (correct incantations in the commit message, maybe run some tests or something).

@Peterl777
Copy link

Peterl777 commented Sep 16, 2021

The issue is still persistent in Pandas 1.3.3

Possible short-term workarounds at #43594

Perhaps a safer change would be to change

and "://" in url
and not url.startswith(("http://", "https://"))

with

and bool(re.match(r'\w+://', url))
and not url.startswith(("http://", "https://"))

This would cater for s3:// and other that I haven't thought of, and address your good idea to "exclude inputs where the double-quote character appears in the part before the :// ". But a quick look at fsspec seems to show that it can handle s3, http and many more, so I'm not sure why they are excluded.

I was thinking of something like:

and not url.startswith(tuple(_VALID_URLS))

but that would disallow file://, which (apparently) is needed.

Some doctests in is_fsspec_url wouldn't go astray. Perhaps starting with:

    """
    Returns true if the given URL looks like
    something fsspec can handle.

    >>> is_fsspec_url('http://www.example.com/file.ext')
    False
 
    >>> is_fsspec_url('s3://example.com/file.ext')
    True
 
    >>> is_fsspec_url('file://file.ext')
    True
    """

might be a good idea?

@steve-mavens
Copy link

steve-mavens commented Sep 16, 2021

I think _VALID_URLS, as used in is_url, are the ones that pandas handles without needing fsspec. Anything else with :// is passed to fsspec. So perhaps unintuitively, for anything pandas handles via fsspec, is_url is False but is_fsspec_url is True. Probably the names are for historical reasons, but that's why s3:// is excluded from is_url.

Pandas isn't supposed to need to know what schemes fsspec handles, because fsspec might add support for new things in future and pandas wants to inherit that automatically. So any kind of test is_fsspec_url based on inclusion in a known list is a non-starter, but for is_url it works because they know what protocols they want to download directly.

There is a PR buried in the history above, #36273 which tried to be clever about this auto-detection, but it was abandoned after taking a couple of different directions. That's the reason I suggested an extremely trivial (non-clever) option, based on the reasoning that if the :// appears inside a JSON string then necessarily there must have been a " earlier, and " definitely isn't valid in a URL scheme. So unless fsspec supports some highly funky and non-standards-compliant URL scheme, ".*:// is definitely JSON data.

@Peterl777
Copy link

The current is_fsspec_url seems to be True for anything_but_http_or_https://extra . So _VALID_URLS would catch much more.
It might be safer (less likely to have unintended consequences) to just go for "fixing" the

and "://" in url

by making that more restrictive. We could use re:

and bool(re.match(r'\w+://', url))

but there's no re in common.py at the moment and some people might not like it. I like your idea of looking for double-quotes, so a low-impact change might be:

# There's a :// in the URL, and if there's a double-quote, it's after the protocol
and '://' in url and (url.find('"') > url.find('://')) if '"' in url else True 

I also haven't submitted any patches to pandas, but am happy to or to work with you to.
What do you think?

@steve-mavens
Copy link

We could perhaps submit two PRs, one using re and one without, and then whoever comes along to look at the PRs can express a preference!

For the more precise technique with a regex: '\w' isn't exactly right, because the BNF from RFC 3986 is scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ), where ALPHA and DIGIT refer to ASCII ranges. What we most care about is what fsspec considers to be a viable URL, which probably matches the standard but not necessarily, so I think one of us would need to dig into that and check. I won't be able to spend any time on this until (at the earliest) the week after next, but I'm happy to either do something then, or for you to press on without me.

I think in my own code I'll probably just stick in the StringIO wrapper to avoid the whole issue. It's slightly surprising to me that although the short-form documentation for read_json says it accepts JSON data, in practice it seems like very few people actually pass JSON data into it as a string. At least, not JSON data that contains URLs anywhere.

@tbachlechner
Copy link
Author

I can resurrect a PR, i'd love this fixed.

@Peterl777
Copy link

Peterl777 commented Sep 18, 2021

Yes, the StringIO workaround is the one I'm using.
For the re solution, with @steve-mavens great input of the RFC, would need to be:

and bool(re.match(r'[A-Za-z0-9+.\-]+://', url))

(The dash doesn't need escaping, but I do like to do it for safety)
I've got that running on my production code successfully, but obviously needs proper testing.

@mntss
Copy link
Contributor

mntss commented Nov 25, 2021

This issue is impacting us right now. I've opened a PR that checks the schema according to RFC mentioned by @steve-mavens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
8 participants