Skip to content

DOC/DEPR: to_hdf produces a different file each time #51456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
cschell opened this issue Feb 17, 2023 · 14 comments
Open
2 of 3 tasks

DOC/DEPR: to_hdf produces a different file each time #51456

cschell opened this issue Feb 17, 2023 · 14 comments
Labels
Deprecate Functionality to remove in pandas Docs IO HDF5 read_hdf, HDFStore

Comments

@cschell
Copy link

cschell commented Feb 17, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# %% setup
import numpy as np
import hashlib

test_data = np.arange(9).reshape(3,3)

# %% hash of pandas/pytables output changes every time


import pandas as pd
pd.DataFrame(test_data).to_hdf("file_a.h5", key="test")
print("pandas hash", hashlib.md5(open("file_a.h5", "rb").read()).hexdigest())


# %% h5py hash stays the same
import h5py

with h5py.File("file_b.h5", "w") as f:
    dset = f.create_dataset("demo", (3,3), dtype='f')
    dset[...] = test_data

print("h5py hash", hashlib.md5(open("file_b.h5", "rb").read()).hexdigest())

Issue Description

Currently, to_hdf produces a different file, even though the data itself does not change. Perhaps there are timestamps or similar added on a lower level, which ultimately produces different hashes for practically the same contents.

Expected Behavior

I would expect to_hdf to be deterministic, so it should write the same bits to the disk if the input stays the same. In my case this would help versioning hdf5 files produced with pandas, since I can use the md5/sha/whatever hash to reference a specific version, and verify that my pipeline is producing the same (or different) data files.
As a comparison, h5py produces the same file each time.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-52-generic
Version : #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.UTF-8
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.8.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
zstandard : None
tzdata : 2022.6

@cschell cschell added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2023
@cschell cschell changed the title BUG: BUG: to_hdf produces a different file each time Feb 17, 2023
@bashtage
Copy link
Contributor

While I agree that the file differ, h5diff shows that they are the same. I also found them identical when I hand-compared the contents in HDFView. I suspect this bug is upstream as I can't see how pandas is producing this difference.

@bashtage
Copy link
Contributor

One observation from

cmp -l file_a.h5 fileba.h5 | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}'
00000A75 CB 28
00000A76 B5 B8
00001605 CB 28
00001606 B5 B8
00001845 CB 28
00001846 B5 B8
00001A1D CB 28
00001A1E B5 B8

Is that the places where the files differ repeat.

@rbenes
Copy link
Contributor

rbenes commented Feb 17, 2023

we had the same issue few years ago and we made a an issue (#32682) and later MR. You have to use HDFStore and put method (https://pandas.pydata.org/docs/reference/api/pandas.HDFStore.put.html) with track_times set to False - the check sums will be consistent.

@bashtage
Copy link
Contributor

Would be great if someone wanted to add to the Notes section explaining this.

@bashtage
Copy link
Contributor

Should track_time=False become the default? AFAICT the ctime that is being set is invisible to pandas.

@rbenes
Copy link
Contributor

rbenes commented Feb 20, 2023

Should track_time=False become the default? AFAICT the ctime that is being set is invisible to pandas.

See comment in original MR

IMHO comment in Note section mentioning the consistency of hashes would be fine.

@phofl phofl added the IO HDF5 read_hdf, HDFStore label Feb 28, 2023
@topper-123
Copy link
Contributor

topper-123 commented May 15, 2023

I think we can agree this is not bug, but the question is if we've set a bad default here, i.e. do we want to change track_times to False internally in to_hdf.

I can see the utility of storing the creation time, but also in getting the same checksum each time, so I guess a decision just has to be made. I'm ok with documenting this on the to_hdf method.

@topper-123 topper-123 added Docs and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 15, 2023
@topper-123 topper-123 changed the title BUG: to_hdf produces a different file each time DOC/DEPR: to_hdf produces a different file each time May 15, 2023
@topper-123 topper-123 added the Deprecate Functionality to remove in pandas label May 15, 2023
@bashtage
Copy link
Contributor

I also think this is a cause of a bad default. Perhaps switch the default to False and then add a note saying that to track file creation times, set track_times=True.

@topper-123
Copy link
Contributor

I also think this is a cause of a bad default. Perhaps switch the default to False and then add a note saying that to track file creation times, set track_times=True.

I think that's reasonable. If people want to save the time, they can use HDFStore.

@puer-robustus
Copy link

I feel like this is more involved than just setting the track_times default to False internally in to_hdf - assuming that it shares the same functionality as HDFStore.put() for writing HDF5 files under the hood.

More specifically, running the following script several times will show that only if HDFStore.put() is being passed track_times=False in conjunction with index=None and format="table" the md5sum hash is persistent over runs:

import hashlib

import numpy as np
import pandas as pd

test_data = np.arange(9).reshape(3,3)

df = pd.DataFrame(test_data)
key = "test"

df.to_hdf(f'to_hdf.h5', key=key)
print(hashlib.md5(open("to_hdf.h5", "rb").read()).hexdigest(), '   to_hdf')

for i, kws in enumerate((
    {'track_times': False},
    {'track_times': False, 'index': None},
    {'track_times': False, 'format': "table"},
    {'track_times': False, 'format': "table", 'index': None},
)):

    with pd.HDFStore(f'store_{i}.h5') as store:
       store.put(key=key, value=df, **kws)
    print(hashlib.md5(open(f"store_{i}.h5", "rb").read()).hexdigest(), f'   hdfstore.put({kws})')

Is this a known limitation?

@bashtage
Copy link
Contributor

bashtage commented Jan 9, 2025

Can you dig into the hdf files and find the difference across the runs with track_time False?

@puer-robustus
Copy link

Can you dig into the hdf files and find the difference across the runs with track_time False?

Not sure how much that helps but quickly hexdumping the contents of the HDF5 files from 2 different runs from the store_1.h5 files and diffing them, gives me this:

--- hex11	2025-01-09 22:49:24.485460481 +0100
+++ hex21	2025-01-09 22:49:37.513569191 +0100
@@ -93,7 +93,7 @@
 0000a40 0202 0102 0000 0000 0008 0018 0000 0000
 0000a50 0103 0c48 0000 0000 0000 0018 0000 0000
 0000a60 0000 0000 0000 0000 0012 0008 0000 0000
-0000a70 0001 0000 43e5 6780 000c 0028 0000 0000
+0000a70 0001 0000 4417 6780 000c 0028 0000 0000
 0000a80 0001 0006 0008 0008 4c43 5341 0053 0000
 0000a90 1013 0000 0005 0000 0001 0000 0000 0000
 0000aa0 5241 4152 0059 0000 000c 0028 0000 0000
@@ -152,7 +152,7 @@
 00015d0 0202 0102 0000 0000 0008 0018 0000 0000
 00015e0 0103 0c60 0000 0000 0000 0018 0000 0000
 00015f0 0000 0000 0000 0000 0012 0008 0000 0000
-0001600 0001 0000 43e5 6780 000c 0028 0000 0000
+0001600 0001 0000 4417 6780 000c 0028 0000 0000
 0001610 0001 0006 0008 0008 4c43 5341 0053 0000
 0001620 1013 0000 0005 0000 0001 0000 0000 0000
 0001630 5241 4152 0059 0000 000c 0028 0000 0000
@@ -188,7 +188,7 @@
 0001810 0202 0102 0000 0000 0008 0018 0000 0000
 0001820 0103 0c78 0000 0000 0000 0048 0000 0000
 0001830 0000 0000 0000 0000 0012 0008 0000 0000
-0001840 0001 0000 43e5 6780 000c 0028 0000 0000
+0001840 0001 0000 4417 6780 000c 0028 0000 0000
 0001850 0001 0006 0008 0008 4c43 5341 0053 0000
 0001860 1013 0000 0005 0000 0001 0000 0000 0000
 0001870 5241 4152 0059 0000 000c 0028 0000 0000
@@ -217,7 +217,7 @@
 00019e0 0005 0008 0001 0000 0202 0102 0000 0000
 00019f0 0008 0018 0000 0000 0103 0cc0 0000 0000
 0001a00 0000 0018 0000 0000 0000 0000 0000 0000
-0001a10 0012 0008 0000 0000 0001 0000 43e5 6780
+0001a10 0012 0008 0000 0000 0001 0000 4417 6780
 0001a20 000c 0028 0000 0000 0001 0006 0008 0008
 0001a30 4c43 5341 0053 0000 1013 0000 0005 0000
 0001a40 0001 0000 0000 0000 5241 4152 0059 0000

@bashtage
Copy link
Contributor

I can't reproduce. I modified your code to output the data multiple times.

import hashlib
from collections import defaultdict
import numpy as np
import pandas as pd

test_data = np.arange(9).reshape(3,3)

df = pd.DataFrame(test_data)
key = "test"

df.to_hdf(f'to_hdf.h5', key=key)
print(hashlib.md5(open("to_hdf.h5", "rb").read()).hexdigest(), '   to_hdf')

hashes = defaultdict(set)
configs = (    {'track_times': False},
    {'track_times': False, 'index': None},
    {'track_times': False, 'format': "table"},
    {'track_times': False, 'format': "table", 'index': None},
)
for i, kws in enumerate(configs):
    for i in range(10):
        base = str(kws).replace(" ","").replace("'","").replace(":","-").replace("}","").replace("{","").replace(",","-")
        file_name = f"store {base}.h5"
        with pd.HDFStore(file_name) as store:
            print(key, kws)
            store.put(key=key, value=df, **kws)
        digest = hashlib.md5(open(file_name, "rb").read()).hexdigest()
        hashes_key = tuple([(k,v) for k,v in kws.items()])
        hashes[hashes_key].update([digest])
        print(digest, f'   hdfstore.put({kws})')

The key output is hashes which is a dictionary per config of values. WIthin a particular set of keyword arguments, the hashes are the same, which can be seen by the single hash per config.

In [39]: hashes
Out[39]:
defaultdict(set,
            {(('track_times', False),): {'310256407fa6993982e0d6e7222301c0'},
             (('track_times', False),
              ('index', None)): {'310256407fa6993982e0d6e7222301c0'},
             (('track_times', False),
              ('format', 'table')): {'4733726fb8707bcb43a0124652b1c961'},
             (('track_times', False),
              ('format', 'table'),
              ('index', None)): {'7ea9df78420924c89dde093f0e88b709'}})

@puer-robustus
Copy link

Hmm, looks like a "double" Heisenbug to me:

  1. Running your snippet (with an additional print(hashes) on the last line) as a script multiple times gives me different hashes for the same config across runs for 3 out of the 4 combinations - even though the underlying data in the dataframe haven't changed. Compare e.g.:

    defaultdict(<class 'set'>, {(('track_times', False),):{'cd81dd1b5d3fd327b5c4684953b67d5f'}, (('track_times', False), ('index', None)): {'cd81dd1b5d3fd327b5c4684953b67d5f'}, (('track_times', False), ('format', 'table')): {'9685b9e2369ce3b099d1d1e9799ba0b0'}, (('track_times', False), ('format', 'table'), ('index', None)): {'7ea9df78420924c89dde093f0e88b709'}})
    

    with the hashes you have posted: The only one that is identical is 7ea9df78420924c89dde093f0e88b709 for ('track_times', False), ('format', 'table'), ('index', None) - which supports my hypothesis that the hashes are only reproducible (in the sense of only dependent on the content of the dataframe) with this particular combination of keyword args.

    If the timestamp is not necessarily the problem, maybe we're dealing with some seed value which changes across invocations of python3 and is only reliably handled in the special case of (('track_times', False), ('format', 'table'), ('index', None))?

  2. The sets usually only have one value. In some runs I've seen sets with more than one element (albeit only for the combination (('track_times', False), ('format', 'table'))):

    defaultdict(<class 'set'>, {(('track_times', False),): {'ee18548fec005905d736df1343ed50e0'}, (('track_times', False), ('index', None)): {'ee18548fec005905d736df1343ed50e0'}, (('track_times', False), ('format', 'table')): {'a1a0649bdff88258cd7a1323df3f180e', 'c70aa56e53a307577684c27f2fc0e73d'}, (('track_times', False), ('format', 'table'), ('index', None)): {'7ea9df78420924c89dde093f0e88b709'}})
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Docs IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

6 participants