BUG: `pd.read_json` throwing error on `bytes` input #46935

galipremsagar · 2022-05-03T22:57:58Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [1]: buffer = (
   ...:         b'{"amount": 100, "name": "Alice"}\n'
   ...:         b'{"amount": 200, "name": "Bob"}\n'
   ...:         b'{"amount": 300, "name": "Charlie"}\n'
   ...:         b'{"amount": 400, "name": "Dennis"}\n'
   ...:     )

In [2]: import pandas as pd

In [3]: pd.read_json(buffer)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 pd.read_json(buffer)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/json/_json.py:588, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options)
    585 if convert_axes is None and orient != "table":
    586     convert_axes = True
--> 588 json_reader = JsonReader(
    589     path_or_buf,
    590     orient=orient,
    591     typ=typ,
    592     dtype=dtype,
    593     convert_axes=convert_axes,
    594     convert_dates=convert_dates,
    595     keep_default_dates=keep_default_dates,
    596     numpy=numpy,
    597     precise_float=precise_float,
    598     date_unit=date_unit,
    599     encoding=encoding,
    600     lines=lines,
    601     chunksize=chunksize,
    602     compression=compression,
    603     nrows=nrows,
    604     storage_options=storage_options,
    605     encoding_errors=encoding_errors,
    606 )
    608 if chunksize:
    609     return json_reader

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/json/_json.py:673, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors)
    670     if not self.lines:
    671         raise ValueError("nrows can only be passed if lines=True")
--> 673 data = self._get_data_from_filepath(filepath_or_buffer)
    674 self.data = self._preprocess_data(data)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/json/_json.py:710, in JsonReader._get_data_from_filepath(self, filepath_or_buffer)
    703 filepath_or_buffer = stringify_path(filepath_or_buffer)
    704 if (
    705     not isinstance(filepath_or_buffer, str)
    706     or is_url(filepath_or_buffer)
    707     or is_fsspec_url(filepath_or_buffer)
    708     or file_exists(filepath_or_buffer)
    709 ):
--> 710     self.handles = get_handle(
    711         filepath_or_buffer,
    712         "r",
    713         encoding=self.encoding,
    714         compression=self.compression,
    715         storage_options=self.storage_options,
    716         errors=self.encoding_errors,
    717     )
    718     filepath_or_buffer = self.handles.handle
    720 return filepath_or_buffer

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/common.py:826, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    821     is_wrapped = not (
    822         isinstance(ioargs.filepath_or_buffer, str) or ioargs.should_close
    823     )
    825 if "r" in ioargs.mode and not hasattr(handle, "read"):
--> 826     raise TypeError(
    827         "Expected file path name or file-like object, "
    828         f"got {type(ioargs.filepath_or_buffer)} type"
    829     )
    831 handles.reverse()  # close the most recently added buffer first
    832 if ioargs.should_close:

TypeError: Expected file path name or file-like object, got <class 'bytes'> type

Issue Description

When a bytes input is passed to pd.read_json it parsed the input and returned data frame until 1.3.x versions. But throwing an error in 1.4.2 version.

Expected Behavior

Same as previous versions? or if this is a breaking behavior not explicitly called out in the changelog?

Out[11]: 
   amount     name
0     100    Alice
1     200      Bob
2     300  Charlie
3     400   Dennis

Installed Versions

In [4]: pd.show_versions()

AssertionError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pd.show_versions()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_print_versions.py:109, in show_versions(as_json)
94 """
95 Provide useful information, important for bug reports.
96
(...)
106 * If True, outputs info in JSON format to the console.
107 """
108 sys_info = _get_sys_info()
--> 109 deps = _get_dependency_info()
111 if as_json:
112 j = {"system": sys_info, "dependencies": deps}

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_print_versions.py:88, in _get_dependency_info()
86 result: dict[str, JSONSerializable] = {}
87 for modname in deps:
---> 88 mod = import_optional_dependency(modname, errors="ignore")
89 result[modname] = get_version(mod) if mod else None
90 return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/compat/_optional.py:138, in import_optional_dependency(name, extra, errors, min_version)
133 msg = (
134 f"Missing optional dependency '{install_name}'. {extra} "
135 f"Use pip or conda to install {install_name}."
136 )
137 try:
--> 138 module = importlib.import_module(name)
139 except ImportError:
140 if errors == "raise":

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/importlib/init.py:127, in import_module(name, package)
125 break
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)

File :1014, in _gcd_import(name, package, level)

File :991, in find_and_load(name, import)

File :975, in find_and_load_unlocked(name, import)

File :671, in _load_unlocked(spec)

File :843, in exec_module(self, module)

File :219, in _call_with_frames_removed(f, *args, **kwds)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/setuptools/init.py:8, in
5 import re
6 import warnings
----> 8 import _distutils_hack.override # noqa: F401
10 import distutils.core
11 from distutils.errors import DistutilsOptionError

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/override.py:1, in
----> 1 import('_distutils_hack').do_override()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/init.py:72, in do_override()
70 if enabled():
71 warn_distutils_present()
---> 72 ensure_local_distutils()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/init.py:59, in ensure_local_distutils()
57 # check that submodules load as expected
58 core = importlib.import_module('distutils.core')
---> 59 assert '_distutils' in core.file, core.file
60 assert 'setuptools._distutils.log' not in sys.modules

AssertionError: /nvme/0/pgali/envs/cudfdev/lib/python3.8/distutils/core.py

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2022-05-04T09:06:58Z

Thanks @galipremsagar for the report.

This may have happened to work previously, but off the top of my head I'm not sure how much support pandas still has since we dropped Python 2 for accepting bytes where str is the documented accepted type.

Will label as a regression for now, pending further investigation.

first bad commit: [5f36af3] MAINT: rename IOError -> OSError (#43366)

cc @mwtoews

mwtoews · 2022-05-04T09:42:12Z

The documentation for pandas.read_json describes the first argument:

path_or_buf : a valid JSON str, path object or file-like object

so raising TypeError for a <class 'bytes'> type should be the appropriate outcome.

(One exception is pandas.read_excel, which accepts bytes as the binary content of an Excel file)

Nevertheless, the input of "buffer" would need to be processed somehow to valid JSON. pd.read_json(buffer.decode()) with any recent version seems to only raise ValueError: Trailing data. Valid JSON would start with [, end with ], and join the lines with a ,':

[{"amount": 100, "name": "Alice"},
 {"amount": 200, "name": "Bob"},
 {"amount": 300, "name": "Charlie"},
 {"amount": 400, "name": "Dennis"}]

mwtoews · 2022-05-04T09:53:20Z

An alternative solution using the original "buffer" input, splitting on newlines:

parts = [pd.read_json(itm, orient="index") for itm in buffer.decode().split("\n") if itm]
df = pd.concat(parts, axis=1).T.reset_index(drop=True)

twoertwein · 2022-05-22T16:24:18Z

Potential duplicate of #45935

simonjayhawkins · 2022-05-23T15:18:23Z

Potential duplicate of #45935

agreed. The comments in #45935 do not account for the issue being a regression (an undocumented behavior change) and should not have been changed without deprecation, but one could equally argue that this was not a breaking api change since the documentation does not include bytes as an accepted type.

will close as no action for same reasons as commented in #45935, but feel free to comment here if strongly disagree.

- pandas.read_json needs to take in a decoded string - see pandas-dev/pandas#46935

galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 4, 2022

code sample for pandas-dev#46935

0e793a8

simonjayhawkins added Regression Functionality that used to work in a prior pandas version IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022

simonjayhawkins added this to the 1.4.3 milestone May 4, 2022

simonjayhawkins closed this as completed May 23, 2022

simonjayhawkins modified the milestones: 1.4.3, No action May 23, 2022

joshzarrabi added a commit to joshzarrabi/health-equity-tracker that referenced this issue Jul 6, 2022

upgrade gcs_to_bq_util to work with pandas 1.4.x

535ac6a

- pandas.read_json needs to take in a decoded string - see pandas-dev/pandas#46935

joshzarrabi mentioned this issue Jul 6, 2022

upgrade gcs_to_bq_util to work with pandas 1.4.x SatcherInstitute/health-equity-tracker#1652

Merged

joshzarrabi added a commit to SatcherInstitute/health-equity-tracker that referenced this issue Jul 6, 2022

upgrade gcs_to_bq_util to work with pandas 1.4.x (#1652)

5d0b2fb

- pandas.read_json needs to take in a decoded string - see pandas-dev/pandas#46935

ebuddenberg added a commit to ebuddenberg/health-equity-tracker that referenced this issue Aug 21, 2023

upgrade gcs_to_bq_util to work with pandas 1.4.x (#1652)

1802305

- pandas.read_json needs to take in a decoded string - see pandas-dev/pandas#46935

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `pd.read_json` throwing error on `bytes` input #46935

BUG: `pd.read_json` throwing error on `bytes` input #46935

galipremsagar commented May 3, 2022

In [4]: pd.show_versions()

simonjayhawkins commented May 4, 2022

mwtoews commented May 4, 2022

mwtoews commented May 4, 2022

twoertwein commented May 22, 2022

simonjayhawkins commented May 23, 2022

BUG: pd.read_json throwing error on bytes input #46935

BUG: pd.read_json throwing error on bytes input #46935

Comments

galipremsagar commented May 3, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

In [4]: pd.show_versions()

simonjayhawkins commented May 4, 2022

mwtoews commented May 4, 2022

mwtoews commented May 4, 2022

twoertwein commented May 22, 2022

simonjayhawkins commented May 23, 2022

BUG: `pd.read_json` throwing error on `bytes` input #46935

BUG: `pd.read_json` throwing error on `bytes` input #46935