-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Saving many datasets in a single group slows down with each new addition #58248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! Between the first and the last run I'm seeing 0.2ms and 34ms spent in Line 1916 in 6e09e97
But I have no familiarity with this code or pytables, so not sure if this is a pandas issue. |
@rhshadrach the Unfortunately this doesn't work because apparently the Lines 295 to 302 in 6e09e97
From my point of view this is unexpected and I would be curious to understand if it is really necessary. |
@rhshadrach I've tested this with HDFStore and it doesn't seem that the slow time is due to opening the file for every write. size = 50000
timings = []
for i in tqdm.tqdm(range(0, size), total=size):
key = ''.join(random.choices(string.ascii_uppercase, k=20))
start = time.time()
with pd.HDFStore("test2.h5", 'w') as store:
store.put(f'entry/{key}', df)
timings.append(time.time() - start)
plt.plot(timings[10:]) |
@rhshadrach in your example you are creating a new HDF3 file at each iteration so the file always has at most one group. Lines 295 to 302 in 6e09e97
In this case you can observe the linear slowdown. |
@avalentino right, my mistake. Opening with |
SorryI have just realised that I have not tagged the correct person in my previous post. |
@avalentino Just for curiosity, how is the cache computed? I've noticed that if I split the datasets in multiple groups (still increasingly slow), it makes writing much faster even with the file reopening issue. |
Basically PyTables loads the list of all names in an HDF5 group before doing any change of it. This is of course done only once and caches. This is mostly done to support the "natural naming" feature. The time for the operation apparently depends on the number of nodes in the group. In principle "the natural naming" for interactive shells is not a fundamental feature but all the machinery is there since the very beginning so touching it could require some work. In any case, IMHO I would say that the way to go for your use case is keep adding nodes without closing the file, so I do not see a compelling reason to start thinking to deep changes into PyTables. Do you agree? |
@avalentino thanks for the input! yes, that seems a reasonable solution. Hopefully, pandas will update their function to accept an HDFStore too. |
From a quick look, it does appear straight forward to expand PRs to add are welcome! |
@rhshadrach for what I can say both the documentation and the implementation seem to suggest that the |
…pandas-dev#58275) * Avoid unnecessary re-opening of HDF5 files * Update the whatsnew file * Move the changelog entry for pandas-dev#58248 to the correct section
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I found a strange behaviour that seems to appear only for PyTabels (Pandas).
Saving many datasets within a single group becomes progressively slower.
This is not the case for
h5py
, which can easily write up to 100 more datasets without slowing down.Installed Versions
Replace this line with the output of pd.show_versions()
Prior Performance
I have raised this issue in the Pytables repo and it seems it is actually an issue with pandas https://github.com/PyTables/PyTables/issues/1155
The text was updated successfully, but these errors were encountered: