PERF: memory usage discrepancy `pandas` vs `psutil` #52329

j-bennet · 2023-03-31T18:15:31Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I'm trying to measure Pandas dataframe memory usage with string[pyarrow] vs object. While .memory_usage(deep=True) shows string[pyarrow] occupying less memory than object, the results are very different when measuring memory usage with psutil. Repro script:

# proc_mem.py
import psutil
import os
import random
import string
import pandas as pd
import gc
import sys

def format_mb(n):
    return f"{n / 1024 ** 2:,.2f} MiB"

def random_string():
    return "".join(random.choices(string.ascii_letters + string.digits + " ", k=random.randint(10, 100)))

def random_strings(n, n_unique=None):
    if n_unique is None:
        n_unique = n
    if n == n_unique:
        return (random_string() for _ in range(n_unique))
    choices = [random_string() for _ in range(n_unique)]
    return (random.choice(choices) for _ in range(n))

if __name__ == "__main__":
    if len(sys.argv) == 4:
        string_dtype = sys.argv[1]
        N = int(sys.argv[2])
        N_UNIQUE = int(sys.argv[3])
    else:
        print(f"Usage: {sys.argv[0]} [STRING_DTYPE] [N] [N_UNIQUE]")
        exit(1)
    process = psutil.Process(os.getpid())
    mem1 = process.memory_info().rss
    print(f"{string_dtype = }, {N = :,}, {N_UNIQUE = :,}")
    print(f"before: {format_mb(mem1)}")
    s = pd.Series(random_strings(N, N_UNIQUE), dtype=string_dtype, copy=True)
    mem_s = s.memory_usage(deep=True)
    print(f"pandas reported: {format_mb(mem_s)}")
    mem2 = process.memory_info().rss
    print(f"psutil reported: {format_mb(mem2 - mem1)}")
    del s
    gc.collect()
    mem3 = process.memory_info().rss
    print(f"released: {format_mb(mem2 - mem3)}")

This produces the following output:

> python proc_mem.py 'object' 1_000_000 1_000_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.80 MiB
pandas reported: 106.84 MiB
psutil reported: 129.78 MiB
released: 104.98 MiB

> python proc_mem.py 'string[pyarrow]' 1_000_000 1_000_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.39 MiB
pandas reported: 56.25 MiB
psutil reported: 117.33 MiB
released: 52.39 MiB

pandas tells me that string[pyarrow] Series consumes 56Mb of memory, while psutil reports increase of 117Mb.

With object dtype, pandas and psutil memory usage is very close. But if not all strings in my dataset are unique, the picture is different:

> python proc_mem.py 'object' 1_000_000 100_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 100,000
before: 82.81 MiB
pandas reported: 106.81 MiB
psutil reported: 34.14 MiB
released: 7.98 MiB

> python proc_mem.py 'string[pyarrow]' 1_000_000 100_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 100,000
before: 83.05 MiB
pandas reported: 56.38 MiB
psutil reported: 119.30 MiB
released: 52.53 MiB

Nothing changed with string[pyarrow]. pandas still reports 2x less memory usage than psutil.

With object, pandas now tells me that memory usage is 106Mb, while psutil sees only a 34Mb increase, and only 7Mb is released after gc.collect().

What is going on here, and how can I explain these results?

I saw these as well, they are possibly related:

Installed Versions

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.11.0.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0rc1
numpy : 1.24.2
pytz : 2023.2
dateutil : 2.8.2
setuptools : 67.6.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.12.0
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

j-bennet · 2023-03-31T18:17:27Z

cc @jrbourbeau

rhshadrach · 2023-03-31T20:25:50Z

Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434

j-bennet · 2023-03-31T21:21:18Z

Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434

Pretty sure this is the case for the not-all-unique dataset, yes. But then, shouldn't pandas know that the data doesn't really occupy 106Mb?

j-bennet · 2023-03-31T21:37:42Z

I see more than one thing being weird in this experiment:

if arrow strings should only occupy 56Mb according to pandas, why do I see a 117Mb bump in process memory (~2x)
with only 1/10 strings being unique, assuming cpython string caching is a factor, why does Pandas still estimate memory usage as 106Mb.

Basically, from this experiment, it looks like I don't save any memory by using arrow strings, and if my data is not completely unique, objects would actually fare better, because of internal Python optimizations. Is that the conclusion I should make?

rhshadrach · 2023-04-05T21:08:42Z

But then, shouldn't pandas know that the data doesn't really occupy 106Mb?

I don't see why/how pandas would know this. Currently we just take the size for each value.

j-bennet added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 31, 2023

DeaMariaLeon added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: memory usage discrepancy `pandas` vs `psutil` #52329

PERF: memory usage discrepancy `pandas` vs `psutil` #52329

j-bennet commented Mar 31, 2023 •

edited

Loading

INSTALLED VERSIONS

j-bennet commented Mar 31, 2023

rhshadrach commented Mar 31, 2023

j-bennet commented Mar 31, 2023

j-bennet commented Mar 31, 2023

rhshadrach commented Apr 5, 2023

PERF: memory usage discrepancy pandas vs psutil #52329

PERF: memory usage discrepancy pandas vs psutil #52329

Comments

j-bennet commented Mar 31, 2023 • edited Loading

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

j-bennet commented Mar 31, 2023

rhshadrach commented Mar 31, 2023

j-bennet commented Mar 31, 2023

j-bennet commented Mar 31, 2023

rhshadrach commented Apr 5, 2023

PERF: memory usage discrepancy `pandas` vs `psutil` #52329

PERF: memory usage discrepancy `pandas` vs `psutil` #52329

j-bennet commented Mar 31, 2023 •

edited

Loading