Skip to content

read_json different behaviour when file is os path or file url #27135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gtmaskall opened this issue Jun 29, 2019 · 2 comments · Fixed by #34811
Closed

read_json different behaviour when file is os path or file url #27135

gtmaskall opened this issue Jun 29, 2019 · 2 comments · Fixed by #34811
Labels
Bug IO JSON read_json, to_json, json_normalize
Milestone

Comments

@gtmaskall
Copy link

Code Sample, a copy-pastable example if possible

pwd
/home/guy/tmp
echo -e '{"a": 1, "b": 2}\n{"a": 3, "b": 4}' > test.json
cat test.json 
{"a": 1, "b": 2}
{"a": 3, "b": 4}
# Your code here
[ins] In [74]: ospath = '/home/guy/tmp/test.json'                                                                       

[ins] In [75]: fileurl = 'file://localhost/home/guy/tmp/test.json'                                                      

[nav] In [76]: import pandas as pd                                                                                      

[ins] In [77]: pd.read_json(ospath, lines=True)                                                                         
Out[77]: 
   a  b
0  1  2
1  3  4

[ins] In [78]: pd.read_json(fileurl, lines=True)                                                                        
Out[78]: 
   a  b
0  1  2
1  3  4

[ins] In [79]: reader = pd.read_json(ospath, lines=True, chunksize=1)                                                   

[ins] In [80]: for chunk in reader: 
          ...:     print(chunk) 
          ...:                                                                                                          
   a  b
0  1  2
   a  b
1  3  4

[ins] In [81]: reader = pd.read_json(fileurl, lines=True, chunksize=1)

Problem description

Create a very simple two-line JSON file. Specify the location of the file two ways - using an OS path, and using a file URL. Both allow read_json() to read the JSON when read in one go. If using chunksize to create a reader, only the OS path specifier works. Trying to use the file path specifier produces a TypeError: sequence item 0: expected str instance, bytes found

The read_json doc says:
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json

This leads to the expectation that a file URL should behave the same as a simple OS path under all use cases.

I think this is the root cause of the problem described under issue #27022

# output when using fileurl:
TypeError                                 Traceback (most recent call last)
<ipython-input-82-605ab8a466fd> in <module>
----> 1 for chunk in reader:
      2     print(chunk)
      3 

~/anaconda3/envs/test_latest_pandas_json/lib/python3.7/site-packages/pandas/io/json/json.py in __next__(self)
    579         lines = list(islice(self.data, self.chunksize))
    580         if lines:
--> 581             lines_json = self._combine_lines(lines)
    582             obj = self._get_object_parser(lines_json)
    583 

~/anaconda3/envs/test_latest_pandas_json/lib/python3.7/site-packages/pandas/io/json/json.py in _combine_lines(self, lines)
    520         """
    521         lines = filter(None, map(lambda x: x.strip(), lines))
--> 522         return '[' + ','.join(lines) + ']'
    523 
    524     def read(self):

TypeError: sequence item 0: expected str instance, bytes found

[ins] In [83]: pd.version
Out[83]: '0.24.2'

Expected Output

Behaviour of read_json to be the same regardless of the type of file-like.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Jun 30, 2019

URLs get read in as binary data which conflicts with readlines. Not sure the best solution off the top of my head but you can see the read_json definition in pandas.io.json.json so if you'd like to take a look and submit a PR would certainly be welcome

@WillAyd WillAyd added the IO Data IO issues that don't fit into a more specific label label Jun 30, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Jun 30, 2019
@gtmaskall
Copy link
Author

I'd be honoured and happy to contribute, but it'd be my first contribution so I'm not sure where to start. For example, is it an invariant that URLs get read in as binary data, so I should concentrate on trying to get read_json to cope with binary data?

@jbrockmendel jbrockmendel added IO JSON read_json, to_json, json_normalize and removed IO Data IO issues that don't fit into a more specific label labels Dec 1, 2019
@mroeschke mroeschke added the Bug label May 8, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 15, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 15, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 16, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 16, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 16, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 16, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 17, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jun 25, 2020
fangchenli added a commit to fangchenli/pandas that referenced this issue Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants