Skip to content

BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) #47090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
weikhor opened this issue May 22, 2022 · 6 comments · Fixed by #47104
Labels
Bug Frequency DateOffsets IO Pickle read_pickle, to_pickle
Milestone

Comments

@weikhor
Copy link
Contributor

weikhor commented May 22, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Use pandas version==1.4.2 python environment with current main version of the generate_legacy_storage_files (in https://github.com/pandas-dev/pandas/blob/main/pandas/tests/io/generate_legacy_storage_files.py) to generate pickle file.

python -m tests.io.generate_legacy_storage_files tests/io/data/legacy_pickle/1.4.2/ pickle

Using pandas version==1.5.0.dev0+794.g4f92db3b5d to load pickle file,

path = '/home/open_source/pandas_2/pandas/pandas/tests/io/data/legacy_pickle/1.4.2/1.4.2_x86_64_linux_3.9.7.pickle'
data = pd.read_pickle(path)

Issue Description

The error is raised when read pickle pandas version==1.4.2 with pandas version=1.5.0.dev0+794.g4f92db3b5d

Traceback (most recent call last):
  File "/home/open_source/pandas_2/pandas/pandas/io/pickle.py", line 205, in read_pickle
    return pickle.load(handles.handle)
  File "pandas/_libs/tslibs/timestamps.pyx", line 151, in pandas._libs.tslibs.timestamps._unpickle_timestamp
TypeError: _unpickle_timestamp() takes exactly 4 positional arguments (3 given)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/open_source/pandas_2/pandas/pandas/compat/pickle_compat.py", line 39, in load_reduce
    stack[-1] = func(*args)
  File "pandas/_libs/tslibs/timestamps.pyx", line 151, in pandas._libs.tslibs.timestamps._unpickle_timestamp
TypeError: _unpickle_timestamp() takes exactly 4 positional arguments (3 given)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/open_source/pandas_2/development/development_5.py", line 36, in <module>
    data = pd.read_pickle(path)
  File "/home/open_source/pandas_2/pandas/pandas/io/pickle.py", line 210, in read_pickle
    return pc.load(handles.handle, encoding=None)
  File "/home/open_source/pandas_2/pandas/pandas/compat/pickle_compat.py", line 272, in load
    return up.load()
  File "/root/anaconda3/lib/python3.9/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/home/open_source/pandas_2/pandas/pandas/compat/pickle_compat.py", line 60, in load_reduce
    elif args and issubclass(args[0], PeriodArray):
TypeError: issubclass() arg 1 must be a class

There is incompatibilities issues which this is picking up. (function needs to be made backwards compatible)

  • (looks like need to set the default value of reso in _unpickle_timestamp to NPY_FR_ns)

Expected Behavior

Able to read pickle pandas version==1.4.2 with pandas version=1.5.0.dev0+794.g4f92db3b5d.

Installed Versions

python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.16.3-microsoft-standard-WSL2
Version : #1 SMP Fri Apr 2 22:23:49 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+806.g7389a95737.dirty
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : 6.42.0
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
brotli :
fastparquet : None
fsspec : 2021.08.1
gcsfs : None
matplotlib : 3.4.3
numba : None
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

@weikhor weikhor added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2022
@weikhor weikhor changed the title BUG: BUG: Cannot read pickle generate_legacy_storage_files version 1.4.2 under environment pandas version 1.5.0 (incompatibilites issue) May 22, 2022
@mroeschke mroeschke added Frequency DateOffsets IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2022
@mroeschke
Copy link
Member

cc @jbrockmendel possibly related to Timestamp non-nano pickle changes

@jbrockmendel
Copy link
Member

This came up on another thread. I think the solution was to add a default value reso=NPY_FR_ns to _unpickle_timestamp

@paulmadejong
Copy link

paulmadejong commented Sep 8, 2023

It seems that this issue has come up again after using pandas 2.1.0. Using 2.0.3 still works fine so have reverted my pipeline to 2.0.3 awaiting a fix for 2.1.0. I couldn't find any changes to pickle in 2.1.0 mentioned in the release notes though.

The error in my pipeline where my pytest now fails comparing a reference set stored in a pickle file, witth a new result:

filepath_or_buffer = '/home/test/reference.pkl'
compression = 'infer', storage_options = None
    @doc(
        storage_options=_shared_docs["storage_options"],
        decompression_options=_shared_docs["decompression_options"] % "filepath_or_buffer",
    )
    def read_pickle(
        filepath_or_buffer: FilePath | ReadPickleBuffer,
        compression: CompressionOptions = "infer",
        storage_options: StorageOptions | None = None,
    ) -> DataFrame | Series:
        """
        Load pickled pandas object (or any object) from file.
    
        .. warning::
    
           Loading pickled data received from untrusted sources can be
           unsafe. See `here <https://docs.python.org/3/library/pickle.html>`__.
    
        Parameters
        ----------
        filepath_or_buffer : str, path object, or file-like object
            String, path object (implementing ``os.PathLike[str]``), or file-like
            object implementing a binary ``readlines()`` function.
            Also accepts URL. URL is not limited to S3 and GCS.
    
        {decompression_options}
    
            .. versionchanged:: 1.4.0 Zstandard support.
    
        {storage_options}
    
            .. versionadded:: 1.2.0
    
        Returns
        -------
        same type as object stored in file
    
        See Also
        --------
        DataFrame.to_pickle : Pickle (serialize) DataFrame object to file.
        Series.to_pickle : Pickle (serialize) Series object to file.
        read_hdf : Read HDF5 file into a DataFrame.
        read_sql : Read SQL query or database table into a DataFrame.
        read_parquet : Load a parquet object, returning a DataFrame.
    
        Notes
        -----
        read_pickle is only guaranteed to be backwards compatible to pandas 0.20.3
        provided the object was serialized with to_pickle.
    
        Examples
        --------
        >>> original_df = pd.DataFrame(
        ...     {{"foo": range(5), "bar": range(5, 10)}}
        ...    )  # doctest: +SKIP
        >>> original_df  # doctest: +SKIP
           foo  bar
        0    0    5
        1    1    6
        2    2    7
        3    3    8
        4    4    9
        >>> pd.to_pickle(original_df, "./dummy.pkl")  # doctest: +SKIP
    
        >>> unpickled_df = pd.read_pickle("./dummy.pkl")  # doctest: +SKIP
        >>> unpickled_df  # doctest: +SKIP
           foo  bar
        0    0    5
        1    1    6
        2    2    7
        3    3    8
        4    4    9
        """
        excs_to_catch = (AttributeError, ImportError, ModuleNotFoundError, TypeError)
        with get_handle(
            filepath_or_buffer,
            "rb",
            compression=compression,
            is_text=False,
            storage_options=storage_options,
        ) as handles:
            # 1) try standard library Pickle
            # 2) try pickle_compat (older pandas version) to handle subclass changes
            # 3) try pickle_compat with latin-1 encoding upon a UnicodeDecodeError
    
            try:
                # TypeError for Cython complaints about object.__new__ vs Tick.__new__
                try:
                    with warnings.catch_warnings(record=True):
                        # We want to silence any warnings about, e.g. moved modules.
                        warnings.simplefilter("ignore", Warning)
>                       return pickle.load(handles.handle)
.local/lib/python3.11/site-packages/pandas/io/pickle.py:206: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
values = <DatetimeArray>
[
['2020-06-15 20:19:59.400000',        '2020-06-15 20:20:00',
        '2020-06-15 20:20:00',        '...0:24:59.700000',
 '2020-06-15 20:24:59.700000', '2020-06-15 20:24:59.800000']
]
Shape: (1, 4291), dtype: datetime64[ns]
placement = slice(1, 2, 1)
    def new_block(
        values,
        placement: BlockPlacement,
        *,
        ndim: int,
        refs: BlockValuesRefs | None = None,
    ) -> Block:
        # caller is responsible for ensuring:
        # - values is NOT a NumpyExtensionArray
        # - check_ndim/ensure_block_shape already checked
        # - maybe_coerce_values already called/unnecessary
        klass = get_block_type(values.dtype)
>       return klass(values, ndim=ndim, placement=placement, refs=refs)
E       TypeError: Argument 'placement' has incorrect type (expected pandas._libs.internals.BlockPlacement, got slice)
.local/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2400: TypeError
During handling of the above exception, another exception occurred:
self = <pandas.compat.pickle_compat.Unpickler object at 0x7f3b117d4890>
    def load_reduce(self):
        stack = self.stack
        args = stack.pop()
        func = stack[-1]
    
        try:
>           stack[-1] = func(*args)
.local/lib/python3.11/site-packages/pandas/compat/pickle_compat.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
values = <DatetimeArray>
[
['2020-06-15 20:19:59.400000',        '2020-06-15 20:20:00',
        '2020-06-15 20:20:00',        '...0:24:59.700000',
 '2020-06-15 20:24:59.700000', '2020-06-15 20:24:59.800000']
]
Shape: (1, 4291), dtype: datetime64[ns]
placement = slice(1, 2, 1)
    def new_block(
        values,
        placement: BlockPlacement,
        *,
        ndim: int,
        refs: BlockValuesRefs | None = None,
    ) -> Block:
        # caller is responsible for ensuring:
        # - values is NOT a NumpyExtensionArray
        # - check_ndim/ensure_block_shape already checked
        # - maybe_coerce_values already called/unnecessary
        klass = get_block_type(values.dtype)
>       return klass(values, ndim=ndim, placement=placement, refs=refs)
E       TypeError: Argument 'placement' has incorrect type (expected pandas._libs.internals.BlockPlacement, got slice)
.local/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2400: TypeError
During handling of the above exception, another exception occurred:
    def test_result_METdata_verify():
        # Check equality by runnign create_ADD_observations with verify option and compare with reference
>       pd.testing.assert_frame_equal(new_result_with_verify(), reference_result_with_verify("pkl"))
tests/test_result_METdata.py:121: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_result_METdata.py:37: in reference_result_with_verify
    return pd.read_pickle(os.path.join(AS_HOME, "tests", "ADSB_AIRSUP_MET_20200615_2020.pkl"))
.local/lib/python3.11/site-packages/pandas/io/pickle.py:211: in read_pickle
    return pc.load(handles.handle, encoding=None)
.local/lib/python3.11/site-packages/pandas/compat/pickle_compat.py:225: in load
    return up.load()
/usr/local/lib/python3.11/pickle.py:1213: in load
    dispatch[key[0]](self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <pandas.compat.pickle_compat.Unpickler object at 0x7f3b117d4890>
    def load_reduce(self):
        stack = self.stack
        args = stack.pop()
        func = stack[-1]
    
        try:
            stack[-1] = func(*args)
            return
        except TypeError as err:
            # If we have a deprecated function,
            # try to replace and try again.
    
            msg = "_reconstruct: First argument must be a sub-type of ndarray"
    
            if msg in str(err):
                try:
                    cls = args[0]
                    stack[-1] = object.__new__(cls)
                    return
                except TypeError:
                    pass
            elif args and isinstance(args[0], type) and issubclass(args[0], BaseOffset):
                # TypeError: object.__new__(Day) is not safe, use Day.__new__()
                cls = args[0]
                stack[-1] = cls.__new__(*args)
                return
>           elif args and issubclass(args[0], PeriodArray):
E           TypeError: issubclass() arg 1 must be a class
.local/lib/python3.11/site-packages/pandas/compat/pickle_compat.py:55: TypeError
----------------------------- Captured stderr call -----------------------------

@jbrockmendel
Copy link
Member

@paulmadejong IIUC the fix here will be in BlockManager.__setstate__ where we need to add a check for slice objects and convert those to BlockPlacement objects.

Do note that pickle is not intended for long-term storage. We make an effort at cross-version compatibility, but the guy who was most enthusiastic about that is less active than he once was, so it is less of a priority than it once was.

@paulmadejong
Copy link

@paulmadejong IIUC the fix here will be in BlockManager.__setstate__ where we need to add a check for slice objects and convert those to BlockPlacement objects.

Do note that pickle is not intended for long-term storage. We make an effort at cross-version compatibility, but the guy who was most enthusiastic about that is less active than he once was, so it is less of a priority than it once was.

A fix would be appreciated but also, which format is recommended for long term
storage such as tests? Hdf5? Pickle is fast and relatively small I’d say 😉

@jbrockmendel
Copy link
Member

A fix would be appreciated

I'm taking some time off so am unlikely to do it myself anytime soon. But if you'd like to make a PR go for it!

which format is recommended for long term storage such as tests?

I'd suggest to_parquet/read_parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Frequency DateOffsets IO Pickle read_pickle, to_pickle
Projects
None yet
5 participants