Skip to content

PERF: Fix reference leak in read_hdf #50714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 16, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Jan 12, 2023

With my really hacky benchmark, I get,


       before           after         ratio
     [bb18847e]       [b5ddc810]
     <main>           <fix-hdf-memusage>
-            6.6M                0     0.00  io.hdf.HDF.mem_read_hdf_index('table')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@lithomas1 lithomas1 added Performance Memory or execution speed performance IO HDF5 read_hdf, HDFStore labels Jan 12, 2023
# TODO: Don't abuse internals, need to fix asv
# to detect memory of ndarray views properly
df1 = read_hdf(self.fname, "df1")
return df1.index._data.base # Will be None(0 bytes) if not a view
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan myself of how this is written.

asv's memory benchmarks use pympler.asizeof internally, but with a view of a numpy array, it'll only give us the size of the view.

peakmem avoids this problem by tracking RSS, but that doesn't catch this improvement, since this only improves the memory usage at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we assert this is None in a unit test instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably better.

@mroeschke mroeschke added this to the 2.0 milestone Jan 16, 2023
@mroeschke mroeschke merged commit 9e5a62f into pandas-dev:main Jan 16, 2023
@mroeschke
Copy link
Member

Thanks @lithomas1

@lithomas1 lithomas1 deleted the fix-hdf-memusage branch March 14, 2023 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format
2 participants