Skip to content

Added code to check if file exists for read_json. #29104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mohitanand001
Copy link
Contributor

@mohitanand001 mohitanand001 commented Oct 19, 2019

@jbrockmendel
Copy link
Member

Can you add a test for the bug this fixes

@gfyoung
Copy link
Member

gfyoung commented Oct 19, 2019

We should try to make the error messages consistent across the board i.e. align with what you would get if you did read_csv on a non-existent file.

@mohitanand001
Copy link
Contributor Author

mohitanand001 commented Oct 20, 2019

The code snippet I took is from read_csv itself. Related to this shall we also have same error message for read_excel. I think that has something to do with the fact that read excel uses xlrd.

import pandas as pd
pd.read_csv('file_1.csv')
FileNotFoundError: [Errno 2] File b'file_1.csv' does not exist: b'file_1.csv'

pd.read_excel('file_2.xlsx')
FileNotFoundError: [Errno 2] No such file or directory: 'file_2.xlsx'

@gfyoung
Copy link
Member

gfyoung commented Oct 20, 2019

The code snippet I took is from read_csv itself. Related to this shall we also have same error message for read_excel. I think that has something to do with the fact that read excel uses xlrd.

Great! We can look at updating read_excel error message in a separate PR.

@gfyoung gfyoung added Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize labels Oct 21, 2019
@krishnakatyal
Copy link

@gfyoung which file is to be changed for read_csv?

@mohitanand001
Copy link
Contributor Author

Can you add a test for the bug this fixes

Sure

@mohitanand001
Copy link
Contributor Author

@datapythonista @gfyoung can we name _stringify_path something else, because the name gives an impression that the function would change the path to a string, while it might not always be the case. In case of string buffers it is returned as it is.

@datapythonista
Copy link
Member

That sounds good if you find a better name, But in a separate PR, let's keep this focussed to its goal.

You need to add the required test, may be add a whatsnew section (not sure if we do for something like this, you can check previous whatsnew files), and fix the broken tests.

Thanks for working on this @farziengineer

@mohitanand001
Copy link
Contributor Author

mohitanand001 commented Oct 24, 2019

The challenge here is, we need to differentiate between a json string passed as input and a file_path. We cannot just say that since a string is not a valid path, raise a FileNotFoundException.

import pandas as pd
#This should work just fine.
df = pd.read_json('{"a": [9, 10, 32], "b": [54, 57, 66]}') 

#This should throw an error.
df = pd.read_json(<invalid_path> )

I cannot find any utility function used in pandas which differentiates json string and a file_path.
Or any function which tells if the string is a valid json, or a valid file_path.

@gfyoung
Copy link
Member

gfyoung commented Oct 24, 2019

The challenge here is, we need to differentiate between a json string passed as input and a file_path.

@farziengineer : How would you actually know what the intent is with absolute certainty if the string is neither valid JSON nor a valid path?

@mohitanand001
Copy link
Contributor Author

mohitanand001 commented Oct 24, 2019

@gfyoung
One approach could be to have a utility function that tests string provided is a json or file_path.
Something like the following.

def check_json_or_filepath(filepath_or_buf):
     try:
         json_loads(filepath_or_buf)
         return 'json'
     except ValueError:
         if os.path.exists(filepath_or_buf):
            return 'filepath'
         else:
           return 'strinio'

Looks a bit hackish though.

@gfyoung
Copy link
Member

gfyoung commented Oct 24, 2019

One approach could be to have a utility function that tests string provided is a json or file_path.
Something like the following.

Hackish + it also doesn't answer my question (notice how you're returning strinio if both checks fail, which assumes the intent of the user was always to pass in JSON).

I would leave this part alone as @datapythonista suggested and just add the test.

@mohitanand001
Copy link
Contributor Author

mohitanand001 commented Oct 24, 2019

@gfyoung @datapythonista Hi, The issue is in the method _get_data_from_filepath(self, filepath_or_buffer). It checks if the filepath exists, it reads the content and returns it. If the filepath does not exist, it treats as if the intent of user was to pass a JSON and returns that value itself as the data. There is no check on when the filepath does not exist and the filepath_or_buffer is not a json, what should be our action.

def _get_data_from_filepath(self, filepath_or_buffer):
"""
The function read_json accepts three input types:
1. filepath (string-like)
2. file-like object (e.g. open file object, StringIO)
3. JSON string
This method turns (1) into (2) to simplify the rest of the processing.
It returns input types (2) and (3) unchanged.
"""
data = filepath_or_buffer

With the current changes I have done in my code, https://github.com/pandas-dev/pandas/pull/29104/files , it would be a problem when a json string is passed, since it just checks if filepath_or_buffer is exists, if it doesn't it raises FileNotFoundError

@gfyoung
Copy link
Member

gfyoung commented Oct 24, 2019

It checks if the filepath exists, it reads the content and returns it. If the filepath does not exist, it treats as if the intent of user was to pass a JSON and returns that value itself as the data

I understand, but what I'm trying to make clear is that you can never know for certain what the user intent was in the case when it fails to be both a valid filepath OR valid JSON. I think we're getting way too much in the weeds here on something that, in the interests of 100% correctness, is really going to be a dead end.

change error type in read_json tests
@WillAyd
Copy link
Member

WillAyd commented Nov 8, 2019

I agree with @gfyoung here I don't think it's really possible (or rather worthwhile) to infer intent, unless there is something pathlib may offer

@jbrockmendel
Copy link
Member

@WillAyd is there a way forward on this?

@WillAyd
Copy link
Member

WillAyd commented Nov 29, 2019

I'm not strongly opposed but don't think it's worth a lot of effort here to try and infer intent. @datapythonista had the original issue though and @gfyoung has some input so let's see what they think

@datapythonista
Copy link
Member

Didn't realize we couldn't know whether the user provided a json string or a path when I opened the issue. I guess it makes sense to leave the code as it is. Otherwise we need to make assumptions on the user intent.

Unless it's easy to capture the current error, and get something like:

ValueError: No such file 'no_file.json'. Tried to parse as a JSON string, but got: Unexpected character found when decoding 'null'

ValueError: No such file '{"foo": 1, "bar": ...'. Tried to parse as a JSON string, but got: Unexpected character found when decoding 'null'

@WillAyd
Copy link
Member

WillAyd commented Dec 10, 2019

I guess based on feedback worth closing for now, but can reopen at a later date if new ideas come up

In any case thanks for the PR @farziengineer !

@WillAyd WillAyd closed this Dec 10, 2019
@vampypandya
Copy link
Contributor

@farziengineer How about using regex to check intent?

@janosh
Copy link
Contributor

janosh commented Apr 8, 2022

What's wrong with assuming file path intent

if isinstance(filepath_or_buffer, str) and filepath_or_buffer.lower().endswith(('.json', '.json.gz', '.json.bz2', '.json.whatever')):

I must be missing something but it seems safe to me to issue FileNotFoundError if not isfile(filepath_or_buffer) in case the above if passes. Would definitely be better behavior than current v1.4.2 behavior:

pd.read_json('missing.json')
>>> ValueError: Unexpected character found when decoding 'true'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misleading error messages when opening inexistent json file
9 participants