PERF: Saving many datasets in a single group slows down with each new addition #58248

Ieremie · 2024-04-13T13:02:46Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I found a strange behaviour that seems to appear only for PyTabels (Pandas).

Saving many datasets within a single group becomes progressively slower.

import pandas as pd

df = pd.DataFrame({'A': [1.0] * 1000})  
df = pd.concat([df] * 13, axis=1, ignore_index=True)

size = 5000
timings = []
for i in tqdm.tqdm(range(0, size), total=size):
    key = ''.join(random.choices(string.ascii_uppercase, k=20))

    start = time.time()
    df.to_hdf('test.h5', key=key, mode='a', complevel=9) 
    timings.append(time.time() - start)

plt.plot(timings[10:])

This is not the case for h5py, which can easily write up to 100 more datasets without slowing down.

import h5py

timings = []
size  = 500000
with h5py.File('test2.h5', 'w', libver='latest') as hf:
    group = hf.create_group(f'group')
    for i in tqdm.tqdm(range(0, size), total=size):
        key = ''.join(random.choices(string.ascii_uppercase, k=20))

        start = time.time()
        group.create_dataset(key, data=df.values, compression="gzip", compression_opts=9) 
        timings.append(time.time() - start)

plt.plot(timings[10:])

Installed Versions

Replace this line with the output of pd.show_versions()

Prior Performance

I have raised this issue in the Pytables repo and it seems it is actually an issue with pandas https://github.com/PyTables/PyTables/issues/1155

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-04-13T21:19:13Z

Thanks for the report!

Between the first and the last run I'm seeing 0.2ms and 34ms spent in tables.file.File.create_group respectively. This is from this line here:

pandas/pandas/io/pytables.py

Line 1916 in 6e09e97

group = self._handle.create_group(path, p)

But I have no familiarity with this code or pytables, so not sure if this is a pandas issue.

avalentino · 2024-04-14T15:10:47Z

@rhshadrach the create_group method apparently has an execution time that depends on the number of notes in the hdf5 file.
This is only true for newly opened files because PyTables need to rebuild the internal cache of all nodes. While this behaviour can be debated the obvious workaround is to use an HDFStore instead of the filename in the call to df.to_hdf (as described in PyTables/PyTables#1155 (comment)).

Unfortunately this doesn't work because apparently the HDFStore is converted back to a string ("stringfied") in

pandas/pandas/io/pytables.py

Lines 295 to 302 in 6e09e97

    
           path_or_buf = stringify_path(path_or_buf) 
        
           if isinstance(path_or_buf, str): 
        
               with HDFStore( 
        
                   path_or_buf, mode=mode, complevel=complevel, complib=complib 
        
               ) as store: 
        
                   f(store) 
        
           else: 
        
               f(path_or_buf)

From my point of view this is unexpected and I would be curious to understand if it is really necessary.

Ieremie · 2024-04-15T09:57:12Z

@rhshadrach I've tested this with HDFStore and it doesn't seem that the slow time is due to opening the file for every write.

size = 50000

timings = []
for i in tqdm.tqdm(range(0, size), total=size):
    key = ''.join(random.choices(string.ascii_uppercase, k=20))

    start = time.time()
    with pd.HDFStore("test2.h5", 'w') as store:
        store.put(f'entry/{key}', df) 
    timings.append(time.time() - start)

plt.plot(timings[10:])

avalentino · 2024-04-15T10:30:35Z

@rhshadrach in your example you are creating a new HDF3 file at each iteration so the file always has at most one group.
This introduces a small constant delay.
Moreover I would not even expect issues when you use the HDFStor.put method, because it indeed does not re-open the file.
The problem comes when you move the HDFStore instantiation out of the loop and then use the df.to_hdf method.
In that case you have the number of groups increasing in the HDF5 file and, due to the strange behaviour that I highlighted above, and new HDFStore is created at each to_hdf call even if passes as input.

pandas/pandas/io/pytables.py

Lines 295 to 302 in 6e09e97

    
           path_or_buf = stringify_path(path_or_buf) 
        
           if isinstance(path_or_buf, str): 
        
               with HDFStore( 
        
                   path_or_buf, mode=mode, complevel=complevel, complib=complib 
        
               ) as store: 
        
                   f(store) 
        
           else: 
        
               f(path_or_buf)

In this case you can observe the linear slowdown.

Ieremie · 2024-04-15T10:40:42Z

@avalentino right, my mistake. Opening with with pd.HDFStore("test2.h5", 'a') as store: makes things slow again.

avalentino · 2024-04-15T10:47:10Z

SorryI have just realised that I have not tagged the correct person in my previous post.

Ieremie · 2024-04-15T10:57:05Z

@avalentino Just for curiosity, how is the cache computed? I've noticed that if I split the datasets in multiple groups (still increasingly slow), it makes writing much faster even with the file reopening issue.

avalentino · 2024-04-15T12:07:57Z

Basically PyTables loads the list of all names in an HDF5 group before doing any change of it. This is of course done only once and caches. This is mostly done to support the "natural naming" feature.

The time for the operation apparently depends on the number of nodes in the group.

In principle "the natural naming" for interactive shells is not a fundamental feature but all the machinery is there since the very beginning so touching it could require some work.

In any case, IMHO I would say that the way to go for your use case is keep adding nodes without closing the file, so I do not see a compelling reason to start thinking to deep changes into PyTables.

Do you agree?

Ieremie · 2024-04-15T13:51:51Z

@avalentino thanks for the input! yes, that seems a reasonable solution. Hopefully, pandas will update their function to accept an HDFStore too.

rhshadrach · 2024-04-15T20:40:07Z

From a quick look, it does appear straight forward to expand to_hdf to accept an HDFStore instance. If the implementation is indeed easy, I'm +1 on the enhancement. If for some reason it turns out to be much more difficult than I expect, I'm not necessarily opposed, but it will have to be evaluated and weighed against the maintenance burden.

PRs to add are welcome!

avalentino · 2024-04-16T08:01:31Z

@rhshadrach for what I can say both the documentation and the implementation seem to suggest that the to_hdf method has been designed to be able to use an HDFStore provided in input.
I have created a PR (#58275) that implements a very simple change to avoid systematic re-opening of the hdf5 file at each call to to_hdf.

* Avoid unnecessary re-opening of HDF5 files * Update the whatsnew file * Move the changelog entry for #58248 to the correct section

…pandas-dev#58275) * Avoid unnecessary re-opening of HDF5 files * Update the whatsnew file * Move the changelog entry for pandas-dev#58248 to the correct section

Ieremie added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Apr 13, 2024

rhshadrach mentioned this issue Apr 13, 2024

Saving many datasets in a single group slows down with each new addition PyTables/PyTables#1155

Closed

rhshadrach added the IO HDF5 read_hdf, HDFStore label Apr 14, 2024

rhshadrach added Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2024

avalentino mentioned this issue Apr 16, 2024

Avoid unnecessary re-opening of HDF5 files (Closes: #58248) #58275

Merged

5 tasks

avalentino added a commit to avalentino/pandas that referenced this issue Apr 16, 2024

Move the changelog entry for pandas-dev#58248 to the correct section

eee8cf9

mroeschke closed this as completed in #58275 Apr 16, 2024

mroeschke pushed a commit that referenced this issue Apr 16, 2024

Avoid unnecessary re-opening of HDF5 files (Closes: #58248) (#58275)

bb0fcc2

* Avoid unnecessary re-opening of HDF5 files * Update the whatsnew file * Move the changelog entry for #58248 to the correct section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Saving many datasets in a single group slows down with each new addition #58248

PERF: Saving many datasets in a single group slows down with each new addition #58248

Ieremie commented Apr 13, 2024

rhshadrach commented Apr 13, 2024

avalentino commented Apr 14, 2024

Ieremie commented Apr 15, 2024

avalentino commented Apr 15, 2024 •

edited

Loading

Ieremie commented Apr 15, 2024

avalentino commented Apr 15, 2024

Ieremie commented Apr 15, 2024

avalentino commented Apr 15, 2024

Ieremie commented Apr 15, 2024

rhshadrach commented Apr 15, 2024

avalentino commented Apr 16, 2024

PERF: Saving many datasets in a single group slows down with each new addition #58248

PERF: Saving many datasets in a single group slows down with each new addition #58248

Comments

Ieremie commented Apr 13, 2024

Pandas version checks

Reproducible Example

Installed Versions

Prior Performance

rhshadrach commented Apr 13, 2024

avalentino commented Apr 14, 2024

Ieremie commented Apr 15, 2024

avalentino commented Apr 15, 2024 • edited Loading

Ieremie commented Apr 15, 2024

avalentino commented Apr 15, 2024

Ieremie commented Apr 15, 2024

avalentino commented Apr 15, 2024

Ieremie commented Apr 15, 2024

rhshadrach commented Apr 15, 2024

avalentino commented Apr 16, 2024

avalentino commented Apr 15, 2024 •

edited

Loading