Skip to content

BUG: pd.read_json throwing error on bytes input #46935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
galipremsagar opened this issue May 3, 2022 · 5 comments
Closed
2 of 3 tasks

BUG: pd.read_json throwing error on bytes input #46935

galipremsagar opened this issue May 3, 2022 · 5 comments
Labels
Bug IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version

Comments

@galipremsagar
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [1]: buffer = (
   ...:         b'{"amount": 100, "name": "Alice"}\n'
   ...:         b'{"amount": 200, "name": "Bob"}\n'
   ...:         b'{"amount": 300, "name": "Charlie"}\n'
   ...:         b'{"amount": 400, "name": "Dennis"}\n'
   ...:     )

In [2]: import pandas as pd

In [3]: pd.read_json(buffer)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 pd.read_json(buffer)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/json/_json.py:588, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options)
    585 if convert_axes is None and orient != "table":
    586     convert_axes = True
--> 588 json_reader = JsonReader(
    589     path_or_buf,
    590     orient=orient,
    591     typ=typ,
    592     dtype=dtype,
    593     convert_axes=convert_axes,
    594     convert_dates=convert_dates,
    595     keep_default_dates=keep_default_dates,
    596     numpy=numpy,
    597     precise_float=precise_float,
    598     date_unit=date_unit,
    599     encoding=encoding,
    600     lines=lines,
    601     chunksize=chunksize,
    602     compression=compression,
    603     nrows=nrows,
    604     storage_options=storage_options,
    605     encoding_errors=encoding_errors,
    606 )
    608 if chunksize:
    609     return json_reader

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/json/_json.py:673, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors)
    670     if not self.lines:
    671         raise ValueError("nrows can only be passed if lines=True")
--> 673 data = self._get_data_from_filepath(filepath_or_buffer)
    674 self.data = self._preprocess_data(data)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/json/_json.py:710, in JsonReader._get_data_from_filepath(self, filepath_or_buffer)
    703 filepath_or_buffer = stringify_path(filepath_or_buffer)
    704 if (
    705     not isinstance(filepath_or_buffer, str)
    706     or is_url(filepath_or_buffer)
    707     or is_fsspec_url(filepath_or_buffer)
    708     or file_exists(filepath_or_buffer)
    709 ):
--> 710     self.handles = get_handle(
    711         filepath_or_buffer,
    712         "r",
    713         encoding=self.encoding,
    714         compression=self.compression,
    715         storage_options=self.storage_options,
    716         errors=self.encoding_errors,
    717     )
    718     filepath_or_buffer = self.handles.handle
    720 return filepath_or_buffer

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/io/common.py:826, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    821     is_wrapped = not (
    822         isinstance(ioargs.filepath_or_buffer, str) or ioargs.should_close
    823     )
    825 if "r" in ioargs.mode and not hasattr(handle, "read"):
--> 826     raise TypeError(
    827         "Expected file path name or file-like object, "
    828         f"got {type(ioargs.filepath_or_buffer)} type"
    829     )
    831 handles.reverse()  # close the most recently added buffer first
    832 if ioargs.should_close:

TypeError: Expected file path name or file-like object, got <class 'bytes'> type

Issue Description

When a bytes input is passed to pd.read_json it parsed the input and returned data frame until 1.3.x versions. But throwing an error in 1.4.2 version.

Expected Behavior

Same as previous versions? or if this is a breaking behavior not explicitly called out in the changelog?

Out[11]: 
   amount     name
0     100    Alice
1     200      Bob
2     300  Charlie
3     400   Dennis

Installed Versions

In [4]: pd.show_versions()

AssertionError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 pd.show_versions()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_print_versions.py:109, in show_versions(as_json)
94 """
95 Provide useful information, important for bug reports.
96
(...)
106 * If True, outputs info in JSON format to the console.
107 """
108 sys_info = _get_sys_info()
--> 109 deps = _get_dependency_info()
111 if as_json:
112 j = {"system": sys_info, "dependencies": deps}

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/util/_print_versions.py:88, in _get_dependency_info()
86 result: dict[str, JSONSerializable] = {}
87 for modname in deps:
---> 88 mod = import_optional_dependency(modname, errors="ignore")
89 result[modname] = get_version(mod) if mod else None
90 return result

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/pandas/compat/_optional.py:138, in import_optional_dependency(name, extra, errors, min_version)
133 msg = (
134 f"Missing optional dependency '{install_name}'. {extra} "
135 f"Use pip or conda to install {install_name}."
136 )
137 try:
--> 138 module = importlib.import_module(name)
139 except ImportError:
140 if errors == "raise":

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/importlib/init.py:127, in import_module(name, package)
125 break
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)

File :1014, in _gcd_import(name, package, level)

File :991, in find_and_load(name, import)

File :975, in find_and_load_unlocked(name, import)

File :671, in _load_unlocked(spec)

File :843, in exec_module(self, module)

File :219, in _call_with_frames_removed(f, *args, **kwds)

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/setuptools/init.py:8, in
5 import re
6 import warnings
----> 8 import _distutils_hack.override # noqa: F401
10 import distutils.core
11 from distutils.errors import DistutilsOptionError

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/override.py:1, in
----> 1 import('_distutils_hack').do_override()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/init.py:72, in do_override()
70 if enabled():
71 warn_distutils_present()
---> 72 ensure_local_distutils()

File /nvme/0/pgali/envs/cudfdev/lib/python3.8/site-packages/_distutils_hack/init.py:59, in ensure_local_distutils()
57 # check that submodules load as expected
58 core = importlib.import_module('distutils.core')
---> 59 assert '_distutils' in core.file, core.file
60 assert 'setuptools._distutils.log' not in sys.modules

AssertionError: /nvme/0/pgali/envs/cudfdev/lib/python3.8/distutils/core.py

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 4, 2022
@simonjayhawkins
Copy link
Member

Thanks @galipremsagar for the report.

This may have happened to work previously, but off the top of my head I'm not sure how much support pandas still has since we dropped Python 2 for accepting bytes where str is the documented accepted type.

Will label as a regression for now, pending further investigation.

first bad commit: [5f36af3] MAINT: rename IOError -> OSError (#43366)

cc @mwtoews

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone May 4, 2022
@mwtoews
Copy link
Contributor

mwtoews commented May 4, 2022

The documentation for pandas.read_json describes the first argument:

path_or_buf : a valid JSON str, path object or file-like object

so raising TypeError for a <class 'bytes'> type should be the appropriate outcome.

(One exception is pandas.read_excel, which accepts bytes as the binary content of an Excel file)


Nevertheless, the input of "buffer" would need to be processed somehow to valid JSON. pd.read_json(buffer.decode()) with any recent version seems to only raise ValueError: Trailing data. Valid JSON would start with [, end with ], and join the lines with a ,':

[{"amount": 100, "name": "Alice"},
 {"amount": 200, "name": "Bob"},
 {"amount": 300, "name": "Charlie"},
 {"amount": 400, "name": "Dennis"}]

@mwtoews
Copy link
Contributor

mwtoews commented May 4, 2022

An alternative solution using the original "buffer" input, splitting on newlines:

parts = [pd.read_json(itm, orient="index") for itm in buffer.decode().split("\n") if itm]
df = pd.concat(parts, axis=1).T.reset_index(drop=True)

@twoertwein
Copy link
Member

Potential duplicate of #45935

@simonjayhawkins
Copy link
Member

Potential duplicate of #45935

agreed. The comments in #45935 do not account for the issue being a regression (an undocumented behavior change) and should not have been changed without deprecation, but one could equally argue that this was not a breaking api change since the documentation does not include bytes as an accepted type.

will close as no action for same reasons as commented in #45935, but feel free to comment here if strongly disagree.

@simonjayhawkins simonjayhawkins modified the milestones: 1.4.3, No action May 23, 2022
joshzarrabi added a commit to joshzarrabi/health-equity-tracker that referenced this issue Jul 6, 2022
- pandas.read_json needs to take in a decoded string
- see pandas-dev/pandas#46935
joshzarrabi added a commit to SatcherInstitute/health-equity-tracker that referenced this issue Jul 6, 2022
- pandas.read_json needs to take in a decoded string
- see pandas-dev/pandas#46935
ebuddenberg added a commit to ebuddenberg/health-equity-tracker that referenced this issue Aug 21, 2023
- pandas.read_json needs to take in a decoded string
- see pandas-dev/pandas#46935
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants