Skip to content

Memory leak in Dataframe.memory_usage #29411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hyfjjjj opened this issue Nov 5, 2019 · 3 comments · Fixed by #56614
Closed

Memory leak in Dataframe.memory_usage #29411

hyfjjjj opened this issue Nov 5, 2019 · 3 comments · Fixed by #56614
Labels
DataFrame DataFrame data structure Performance Memory or execution speed performance

Comments

@hyfjjjj
Copy link

hyfjjjj commented Nov 5, 2019

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
import gc
import os
import psutil

def get_process_memory():
  return round(psutil.Process(os.getpid()).memory_info().rss / float(2 ** 20), 2)

test_dict = {}
for i in range(0, 50):
  test_dict[i] = np.empty(10)

dfs = []
for i in range(0, 1000):
  df = pd.DataFrame(test_dict)
  dfs.append(df)

gc.collect()
# before
print('memory usage (before "memory_usage"):\t{} MB'.format(get_process_memory()))

for df in dfs:
  df.memory_usage(index=True, deep=True)

gc.collect()
# after
print('memory usage (after "memory_usage"):\t{} MB'.format(get_process_memory()))

Problem description

Dataframe's memory_usage function has memory leak. Memory usage after executing 'memory_usage' function should be the same as before.

截屏2019-11-05下午5 44 25

Expected Output

None

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.16.final.0
python-bits: 64
OS: Darwin
OS-release: 19.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: zh_CN.UTF-8
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.3.1
setuptools: 19.6.1
Cython: 0.29.13
numpy: 1.16.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2019.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@gfyoung gfyoung added the DataFrame DataFrame data structure label Nov 6, 2019
@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2019

I can confirm that there is indeed an increased memory use after running this script. Does this happen for you if you pass in different values to memory_usage ?

@hyfjjjj
Copy link
Author

hyfjjjj commented Nov 7, 2019

I can confirm that there is indeed an increased memory use after running this script. Does this happen for you if you pass in different values to memory_usage ?

Yes. I've tested several cases and all have a significant increase of memory use.

@jorisvandenbossche
Copy link
Member

I don't think this is a "memory leak" that we are seeing, but just the overhead of creating Series objects, which are kept alive because of the _item_cache (#50547).

To illustrate this: first, if you adapt the script to have columns of 10,000 elements instead of 10 elements (and create 100 dataframes of 20 columns, instead of 1000 of 50 to keep memory somewhat limited), you still see the memory usage increase after calling memory_usage() but much less. In my case from 263.89 MB to 267.54 MB.

A Series object has quite some overhead, and relatively this overhead is big for a small Series. For example for a Series of 10 elements of int/float64, the actual data is 80 bytes. But the Series itself is almost a kilobyte:

In [1]: df = pd.DataFrame({'a': range(10)})

In [2]: s = df['a']

In [3]: s.nbytes
Out[3]: 80

In [5]: import sys

In [6]: total_size = sys.getsizeof(s)

In [7]: total_size
Out[7]: 224

In [8]: for attr in [s._is_copy, s._mgr, s._mgr.axes, s._mgr.blocks, s._mgr.refs, s._item_cache, s._attrs, s._flags, s._flags._allows_duplicate_labels, s._flags._obj, s._name]:
   ...:     total_size += sys.getsizeof(attr)
   ...: 

In [9]: total_size
Out[9]: 790

In [10]: (total_size * 50 * 1000) / 1024**2
Out[10]: 37.670135498046875

I don't know how accurate this estimate is of the size of the Series object, but at least it is much bigger than the actual data in case of 10 rows (and based on this estimate, it's in any case many MBs for the example script).

For this example case of many wide dataframes (n_cols > n_rows), this Series creation overhead starts to count. But if you have (more typical) use case of more rows than columns, you will often not notice this overhead.

That said, it should still be relatively easy to fix this for memory_usage, as there is no need for it to call self.items(). It could also directly iterate over the underlying arrays instead of creating Series objects for each column.
(apart from completely removing this item cache, xref #50547)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants