Skip to content

Fix issue #36271 to disambiguate json string #36273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

tbachlechner
Copy link

@tbachlechner tbachlechner commented Sep 10, 2020

pd.read_json() fails currently for strings that look like fsspec_url and contain "://". adding another condition to fix this at least in most cases

pd.read_json() fails currently for strings that look like fsspec_url and contain "://". adding another condition to fix this at least in most cases
@pep8speaks
Copy link

pep8speaks commented Sep 10, 2020

Hello @tbachlechner! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-11 15:02:53 UTC

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2020

Does this close an issue? Can you add a test?

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Sep 10, 2020
@@ -161,6 +161,7 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool:
return (
isinstance(url, str)
and "://" in url
and not " " in url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space is allowed in URLs for at least some storage systems

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, any other ideas how to fix #36271 ? All my json strings with URLs in them receive errors and I downgraded to pandas==1.0.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a space isn't necessary for valid JSON either. Perhaps the only way to tell is to try to parse the string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to reflect what you say, I'm now trying to parse the string as json and if it succeeds, it's not a URL. How about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me - I don't expect any URL is also valid JSON. You could conceivable make a regex to check for JSON-ness ("^\s*[\[{]" maybe?), but probably calling JSON decode is faster and more correct.

(I simply didn't know that you can pass a string directly to read_json, as opposed to a StringIO file-like)

@@ -161,6 +176,7 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool:
return (
isinstance(url, str)
and "://" in url
and not is_json(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pandas maintainers: should this check only be done within read_json (which will presumably call json.loads anyway, repeating the work) ? Are there any other similar cases that the string passed might be handled as a path, or decoded directly, that I might have missed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_json doesn't call json.loads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbachlechner if you really think t his out to be caught then i suppose a 'better' check instead of and "://" in url could be done (e.g. a regex)

@@ -161,6 +176,7 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool:
return (
isinstance(url, str)
and "://" in url
and not is_json(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_json doesn't call json.loads

def is_json(url: FilePathOrBuffer) -> bool:
"""
Returns true if the given string looks like
something json.loads can handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot do this generally as its completely non-performant

@@ -161,6 +176,7 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool:
return (
isinstance(url, str)
and "://" in url
and not is_json(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbachlechner if you really think t his out to be caught then i suppose a 'better' check instead of and "://" in url could be done (e.g. a regex)

@tbachlechner
Copy link
Author

@jreback yes, this ought to be caught, pd.read_json() is claimed to accept a json string, but it fails for any json string containing ://- which is a lot of json strings. For now we pinned the version pandas==1.0.5. I'm busy shipping products, but I'll see if I get to a fix at some point. The underlying issue though is that is_fsspec_url() should not be returning True when it is not a URL, removing json is just a hack that probably does not fix other problems that likely exist.

@martindurant
Copy link
Contributor

I agree that is_fsspec_url() should be better, and clearly is failing for this JSON case.
Does the regex I suggested earlier ( "^\s*[\[{]" ) not suffice?

@jbrockmendel
Copy link
Member

mypy complaint

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 24, 2020
@twoertwein
Copy link
Member

twoertwein commented Nov 6, 2020

are there opinions about moving read_json(json_string) into a function like parse_json(json_string) for an API-breaking 1.2+ release (are there more read_ functions that would need read_ vs parse_)? Something like:

def parse_json(json:str, **kwargs) -> DataFrame:
    buffer = StringIO(json)
    data = read_json(buffer, **kwargs)
    buffer.close()
    return data

and a deprecation warning for 1.2 when passing a json string to read_json.

edit:
read_excel has the same 'issue', it accepts bytes.

@jbrockmendel
Copy link
Member

are there opinions about moving read_json(json_string) into a function like parse_json(json_string)

IIRC there was a discussion at some point of making a read_jsons mirroring the stdlib's json.load vs json.loads (or maybe the discussion was about pickle?)

@jreback
Copy link
Contributor

jreback commented Nov 6, 2020

it wouldn't be unreasonable to deprecate passing a str (that is not a path) to read_josh and instead only accept a StringIO

i don't think this is much of any issue for pickle as it needs bytes and so str always means a file

excel i think is the same

only read_josh accepts str as an actual transmission format (read_csv as well of course is similar but i think we are restrictive there already)

if json_pattern.match(url):
return True
else:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be simplified return json_pattern.match(url) is not None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I committed your suggestion

Returns true if the given string looks like
json
"""
json_pattern = re.compile(r"^\s*[\[{]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need something like the following for typing

if not isinstance(url, str):
    return False

The pre-commit is complaining about too many new lines.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

closing as stale. if you would like to continue, pls ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: since pandas==1.1.0 pd.read_json() fails for strings that look similar to fsspec_url
7 participants