BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47090

weikhor · 2022-05-22T07:26:52Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Use pandas version==1.4.2 python environment with current main version of the generate_legacy_storage_files (in https://github.com/pandas-dev/pandas/blob/main/pandas/tests/io/generate_legacy_storage_files.py) to generate pickle file.

python -m tests.io.generate_legacy_storage_files tests/io/data/legacy_pickle/1.4.2/ pickle

Using pandas version==1.5.0.dev0+794.g4f92db3b5d to load pickle file,

path = '/home/open_source/pandas_2/pandas/pandas/tests/io/data/legacy_pickle/1.4.2/1.4.2_x86_64_linux_3.9.7.pickle'
data = pd.read_pickle(path)

Issue Description

The error is raised when read pickle pandas version==1.4.2 with pandas version=1.5.0.dev0+794.g4f92db3b5d

Traceback (most recent call last):
  File "/home/open_source/pandas_2/pandas/pandas/io/pickle.py", line 205, in read_pickle
    return pickle.load(handles.handle)
  File "pandas/_libs/tslibs/timestamps.pyx", line 151, in pandas._libs.tslibs.timestamps._unpickle_timestamp
TypeError: _unpickle_timestamp() takes exactly 4 positional arguments (3 given)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/open_source/pandas_2/pandas/pandas/compat/pickle_compat.py", line 39, in load_reduce
    stack[-1] = func(*args)
  File "pandas/_libs/tslibs/timestamps.pyx", line 151, in pandas._libs.tslibs.timestamps._unpickle_timestamp
TypeError: _unpickle_timestamp() takes exactly 4 positional arguments (3 given)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/open_source/pandas_2/development/development_5.py", line 36, in <module>
    data = pd.read_pickle(path)
  File "/home/open_source/pandas_2/pandas/pandas/io/pickle.py", line 210, in read_pickle
    return pc.load(handles.handle, encoding=None)
  File "/home/open_source/pandas_2/pandas/pandas/compat/pickle_compat.py", line 272, in load
    return up.load()
  File "/root/anaconda3/lib/python3.9/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/home/open_source/pandas_2/pandas/pandas/compat/pickle_compat.py", line 60, in load_reduce
    elif args and issubclass(args[0], PeriodArray):
TypeError: issubclass() arg 1 must be a class

There is incompatibilities issues which this is picking up. (function needs to be made backwards compatible)

(looks like need to set the default value of reso in _unpickle_timestamp to NPY_FR_ns)

Expected Behavior

Able to read pickle pandas version==1.4.2 with pandas version=1.5.0.dev0+794.g4f92db3b5d.

Installed Versions

python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.16.3-microsoft-standard-WSL2
Version : #1 SMP Fri Apr 2 22:23:49 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+806.g7389a95737.dirty
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : 6.42.0
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 2021.08.1
gcsfs : None
matplotlib : 3.4.3
numba : None
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-05-23T02:34:39Z

cc @jbrockmendel possibly related to Timestamp non-nano pickle changes

jbrockmendel · 2022-05-23T04:05:25Z

This came up on another thread. I think the solution was to add a default value reso=NPY_FR_ns to _unpickle_timestamp

paulmadejong · 2023-09-08T08:26:17Z

It seems that this issue has come up again after using pandas 2.1.0. Using 2.0.3 still works fine so have reverted my pipeline to 2.0.3 awaiting a fix for 2.1.0. I couldn't find any changes to pickle in 2.1.0 mentioned in the release notes though.

The error in my pipeline where my pytest now fails comparing a reference set stored in a pickle file, witth a new result:

filepath_or_buffer = '/home/test/reference.pkl'
compression = 'infer', storage_options = None
    @doc(
        storage_options=_shared_docs["storage_options"],
        decompression_options=_shared_docs["decompression_options"] % "filepath_or_buffer",
    )
    def read_pickle(
        filepath_or_buffer: FilePath | ReadPickleBuffer,
        compression: CompressionOptions = "infer",
        storage_options: StorageOptions | None = None,
    ) -> DataFrame | Series:
        """
        Load pickled pandas object (or any object) from file.
    
        .. warning::
    
           Loading pickled data received from untrusted sources can be
           unsafe. See `here <https://docs.python.org/3/library/pickle.html>`__.
    
        Parameters
        ----------
        filepath_or_buffer : str, path object, or file-like object
            String, path object (implementing ``os.PathLike[str]``), or file-like
            object implementing a binary ``readlines()`` function.
            Also accepts URL. URL is not limited to S3 and GCS.
    
        {decompression_options}
    
            .. versionchanged:: 1.4.0 Zstandard support.
    
        {storage_options}
    
            .. versionadded:: 1.2.0
    
        Returns
        -------
        same type as object stored in file
    
        See Also
        --------
        DataFrame.to_pickle : Pickle (serialize) DataFrame object to file.
        Series.to_pickle : Pickle (serialize) Series object to file.
        read_hdf : Read HDF5 file into a DataFrame.
        read_sql : Read SQL query or database table into a DataFrame.
        read_parquet : Load a parquet object, returning a DataFrame.
    
        Notes
        -----
        read_pickle is only guaranteed to be backwards compatible to pandas 0.20.3
        provided the object was serialized with to_pickle.
    
        Examples
        --------
        >>> original_df = pd.DataFrame(
        ...     {{"foo": range(5), "bar": range(5, 10)}}
        ...    )  # doctest: +SKIP
        >>> original_df  # doctest: +SKIP
           foo  bar
        0    0    5
        1    1    6
        2    2    7
        3    3    8
        4    4    9
        >>> pd.to_pickle(original_df, "./dummy.pkl")  # doctest: +SKIP
    
        >>> unpickled_df = pd.read_pickle("./dummy.pkl")  # doctest: +SKIP
        >>> unpickled_df  # doctest: +SKIP
           foo  bar
        0    0    5
        1    1    6
        2    2    7
        3    3    8
        4    4    9
        """
        excs_to_catch = (AttributeError, ImportError, ModuleNotFoundError, TypeError)
        with get_handle(
            filepath_or_buffer,
            "rb",
            compression=compression,
            is_text=False,
            storage_options=storage_options,
        ) as handles:
            # 1) try standard library Pickle
            # 2) try pickle_compat (older pandas version) to handle subclass changes
            # 3) try pickle_compat with latin-1 encoding upon a UnicodeDecodeError
    
            try:
                # TypeError for Cython complaints about object.__new__ vs Tick.__new__
                try:
                    with warnings.catch_warnings(record=True):
                        # We want to silence any warnings about, e.g. moved modules.
                        warnings.simplefilter("ignore", Warning)
>                       return pickle.load(handles.handle)
.local/lib/python3.11/site-packages/pandas/io/pickle.py:206: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
values = <DatetimeArray>
[
['2020-06-15 20:19:59.400000',        '2020-06-15 20:20:00',
        '2020-06-15 20:20:00',        '...0:24:59.700000',
 '2020-06-15 20:24:59.700000', '2020-06-15 20:24:59.800000']
]
Shape: (1, 4291), dtype: datetime64[ns]
placement = slice(1, 2, 1)
    def new_block(
        values,
        placement: BlockPlacement,
        *,
        ndim: int,
        refs: BlockValuesRefs | None = None,
    ) -> Block:
        # caller is responsible for ensuring:
        # - values is NOT a NumpyExtensionArray
        # - check_ndim/ensure_block_shape already checked
        # - maybe_coerce_values already called/unnecessary
        klass = get_block_type(values.dtype)
>       return klass(values, ndim=ndim, placement=placement, refs=refs)
E       TypeError: Argument 'placement' has incorrect type (expected pandas._libs.internals.BlockPlacement, got slice)
.local/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2400: TypeError
During handling of the above exception, another exception occurred:
self = <pandas.compat.pickle_compat.Unpickler object at 0x7f3b117d4890>
    def load_reduce(self):
        stack = self.stack
        args = stack.pop()
        func = stack[-1]
    
        try:
>           stack[-1] = func(*args)
.local/lib/python3.11/site-packages/pandas/compat/pickle_compat.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
values = <DatetimeArray>
[
['2020-06-15 20:19:59.400000',        '2020-06-15 20:20:00',
        '2020-06-15 20:20:00',        '...0:24:59.700000',
 '2020-06-15 20:24:59.700000', '2020-06-15 20:24:59.800000']
]
Shape: (1, 4291), dtype: datetime64[ns]
placement = slice(1, 2, 1)
    def new_block(
        values,
        placement: BlockPlacement,
        *,
        ndim: int,
        refs: BlockValuesRefs | None = None,
    ) -> Block:
        # caller is responsible for ensuring:
        # - values is NOT a NumpyExtensionArray
        # - check_ndim/ensure_block_shape already checked
        # - maybe_coerce_values already called/unnecessary
        klass = get_block_type(values.dtype)
>       return klass(values, ndim=ndim, placement=placement, refs=refs)
E       TypeError: Argument 'placement' has incorrect type (expected pandas._libs.internals.BlockPlacement, got slice)
.local/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2400: TypeError
During handling of the above exception, another exception occurred:
    def test_result_METdata_verify():
        # Check equality by runnign create_ADD_observations with verify option and compare with reference
>       pd.testing.assert_frame_equal(new_result_with_verify(), reference_result_with_verify("pkl"))
tests/test_result_METdata.py:121: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_result_METdata.py:37: in reference_result_with_verify
    return pd.read_pickle(os.path.join(AS_HOME, "tests", "ADSB_AIRSUP_MET_20200615_2020.pkl"))
.local/lib/python3.11/site-packages/pandas/io/pickle.py:211: in read_pickle
    return pc.load(handles.handle, encoding=None)
.local/lib/python3.11/site-packages/pandas/compat/pickle_compat.py:225: in load
    return up.load()
/usr/local/lib/python3.11/pickle.py:1213: in load
    dispatch[key[0]](self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <pandas.compat.pickle_compat.Unpickler object at 0x7f3b117d4890>
    def load_reduce(self):
        stack = self.stack
        args = stack.pop()
        func = stack[-1]
    
        try:
            stack[-1] = func(*args)
            return
        except TypeError as err:
            # If we have a deprecated function,
            # try to replace and try again.
    
            msg = "_reconstruct: First argument must be a sub-type of ndarray"
    
            if msg in str(err):
                try:
                    cls = args[0]
                    stack[-1] = object.__new__(cls)
                    return
                except TypeError:
                    pass
            elif args and isinstance(args[0], type) and issubclass(args[0], BaseOffset):
                # TypeError: object.__new__(Day) is not safe, use Day.__new__()
                cls = args[0]
                stack[-1] = cls.__new__(*args)
                return
>           elif args and issubclass(args[0], PeriodArray):
E           TypeError: issubclass() arg 1 must be a class
.local/lib/python3.11/site-packages/pandas/compat/pickle_compat.py:55: TypeError
----------------------------- Captured stderr call -----------------------------

jbrockmendel · 2023-09-08T16:20:03Z

@paulmadejong IIUC the fix here will be in BlockManager.__setstate__ where we need to add a check for slice objects and convert those to BlockPlacement objects.

Do note that pickle is not intended for long-term storage. We make an effort at cross-version compatibility, but the guy who was most enthusiastic about that is less active than he once was, so it is less of a priority than it once was.

paulmadejong · 2023-09-08T17:35:34Z

@paulmadejong IIUC the fix here will be in BlockManager.__setstate__ where we need to add a check for slice objects and convert those to BlockPlacement objects.

Do note that pickle is not intended for long-term storage. We make an effort at cross-version compatibility, but the guy who was most enthusiastic about that is less active than he once was, so it is less of a priority than it once was.

A fix would be appreciated but also, which format is recommended for long term
storage such as tests? Hdf5? Pickle is fast and relatively small I’d say 😉

jbrockmendel · 2023-09-12T22:59:50Z

A fix would be appreciated

I'm taking some time off so am unlikely to do it myself anytime soon. But if you'd like to make a PR go for it!

which format is recommended for long term storage such as tests?

I'd suggest to_parquet/read_parquet

weikhor added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2022

weikhor changed the title ~~BUG:~~ BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) May 22, 2022

mroeschke added Frequency DateOffsets IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2022

jorisvandenbossche added this to the 1.5 milestone May 23, 2022

weikhor mentioned this issue May 24, 2022

BUG Fix: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47104

Merged

4 tasks

mroeschke closed this as completed in #47104 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47090

BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47090

weikhor commented May 22, 2022 •

edited

Loading

mroeschke commented May 23, 2022

jbrockmendel commented May 23, 2022

paulmadejong commented Sep 8, 2023 •

edited

Loading

jbrockmendel commented Sep 8, 2023

paulmadejong commented Sep 8, 2023

jbrockmendel commented Sep 12, 2023

BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47090

BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47090

Comments

weikhor commented May 22, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

mroeschke commented May 23, 2022

jbrockmendel commented May 23, 2022

paulmadejong commented Sep 8, 2023 • edited Loading

jbrockmendel commented Sep 8, 2023

paulmadejong commented Sep 8, 2023

jbrockmendel commented Sep 12, 2023

weikhor commented May 22, 2022 •

edited

Loading

paulmadejong commented Sep 8, 2023 •

edited

Loading