BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format #37441

capslockwizard · 2020-10-27T09:47:48Z

Code Sample

Environment setup:

conda create -n bug_test python=3.8 pandas pytables numpy psutil
conda activate bug_test

Test code:

import psutil
import numpy as np
import pandas as pd
import os
import gc

random_np = np.random.randint(0, 1e16, size=(25000000,4))
random_df = pd.DataFrame(random_np)
random_df['Test'] = np.random.rand(25000000,1)
random_df.set_index(0, inplace=True)
random_df.sort_index(inplace=True)
random_df.to_hdf('test.h5', key='random_df', mode='w', format='table')

del random_np
del random_df
gc.collect()

initial_memory_usage = psutil.Process(os.getpid()).memory_info().rss

random_df = pd.read_hdf('test.h5')
print(f'Memory Usage According to Pandas: {random_df.__sizeof__()/1000000000:.2f}GB')
print(f'Real Memory Usage: {(psutil.Process(os.getpid()).memory_info().rss - initial_memory_usage)/1000000000:.2f}GB')

random_df.index = random_df.index.copy(deep=True)
print(f'Memory Usage After Temp Fix: {(psutil.Process(os.getpid()).memory_info().rss - initial_memory_usage)/1000000000:.2f}GB')

del random_df
gc.collect()
print(f'Memory Usage After Deleting Table: {(psutil.Process(os.getpid()).memory_info().rss - initial_memory_usage)/1000000000:.2f}GB')

Problem description

The above code generates a 1GB df table with mixed data types and saves it to a HDF5 file in "table" format.

Loading the HDF5 file back, we expect it to use 1GB of memory instead it uses 1.8GB of memory. I have found that the issue is with the index of the df. If I do a deep copy and replace it with the copy, the excessive memory usage goes away and memory usage is 1GB as expected.

I have initially encountered this issue when using Pandas 1.0.5 but I have tested this on Pandas 1.1.3 and the issue still exists.

When I was investigating the bug by going through the code of Pandas 1.0.5, I noticed that PyTables was used to read the HDF5 file and returns a NumPy structured array. Dataframes were created using this NumPy array and pd.concat was used to combine them into a single df. The pd.concat makes a copy of the original data instead of just pointing to the NumPy array. However, the index of the combined table still points to the original NumPy array. I think this explains the excessive memory usage because GC cannot collect the NumPy array since there still a reference to it.

Due to significant code changes to the read_hdf in Pandas 1.13, I did not have time to find out if this is still the same problem or another problem.

Expected Output

1GB df table should use 1GB of memory instead of 1.8GB

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : db08276
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.104-microsoft-standard
Version : #1 SMP Wed Feb 19 06:37:35 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.0.post20201006
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

trevorkask · 2021-08-18T23:02:03Z

take

lithomas1 · 2023-01-10T21:52:54Z

Looks like the index is a view of a structured array (which is what PyTables returns to us).

It looks like that view is created here, since adding .copy() to values here seems to fix it

pandas/pandas/io/pytables.py

Lines 2059 to 2060 in 32b3308

    
           if values.dtype.fields is not None: 
        
               values = values[self.cname]

I'll need to dig deeper to figure out why the view is persisting through the concat.

lithomas1 · 2023-01-11T00:24:47Z

#50673

capslockwizard added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 27, 2020

capslockwizard changed the title ~~BUG: Excessive memory usage after loading Dataframe with mixed data types from HDF5 file saved in "tables" format~~ BUG: Excessive memory usage after loading Dataframe with mixed data types from HDF5 file saved in "table" format Oct 27, 2020

capslockwizard changed the title ~~BUG: Excessive memory usage after loading Dataframe with mixed data types from HDF5 file saved in "table" format~~ BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format Nov 5, 2020

jbrockmendel added IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021

github-actions bot assigned trevorkask Aug 18, 2021

lithomas1 mentioned this issue Jan 11, 2023

API/BUG: pd.concat doesn't copy indexes if with axis=1 and copy=True when they are the same #50673

Open

lithomas1 mentioned this issue Jan 12, 2023

PERF: Fix reference leak in read_hdf #50714

Merged

5 tasks

mroeschke closed this as completed in #50714 Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format #37441

BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format #37441

capslockwizard commented Oct 27, 2020

INSTALLED VERSIONS

trevorkask commented Aug 18, 2021

lithomas1 commented Jan 10, 2023

lithomas1 commented Jan 11, 2023

BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format #37441

BUG: Excessive memory usage loading Dataframe with mixed data types from HDF5 file saved in "table" format #37441

Comments

capslockwizard commented Oct 27, 2020

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

trevorkask commented Aug 18, 2021

lithomas1 commented Jan 10, 2023

lithomas1 commented Jan 11, 2023

Output of `pd.show_versions()`