-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_json different behaviour when file is os path or file url #27135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
URLs get read in as binary data which conflicts with readlines. Not sure the best solution off the top of my head but you can see the read_json definition in pandas.io.json.json so if you'd like to take a look and submit a PR would certainly be welcome |
I'd be honoured and happy to contribute, but it'd be my first contribution so I'm not sure where to start. For example, is it an invariant that URLs get read in as binary data, so I should concentrate on trying to get read_json to cope with binary data? |
Code Sample, a copy-pastable example if possible
Problem description
Create a very simple two-line JSON file. Specify the location of the file two ways - using an OS path, and using a file URL. Both allow read_json() to read the JSON when read in one go. If using chunksize to create a reader, only the OS path specifier works. Trying to use the file path specifier produces a TypeError: sequence item 0: expected str instance, bytes found
The read_json doc says:
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be
file://localhost/path/to/table.json
This leads to the expectation that a file URL should behave the same as a simple OS path under all use cases.
I think this is the root cause of the problem described under issue #27022
[ins] In [83]: pd.version
Out[83]: '0.24.2'
Expected Output
Behaviour of read_json to be the same regardless of the type of file-like.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: