Skip to content

PERF: memory usage discrepancy pandas vs psutil #52329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
j-bennet opened this issue Mar 31, 2023 · 5 comments
Open
3 tasks done

PERF: memory usage discrepancy pandas vs psutil #52329

j-bennet opened this issue Mar 31, 2023 · 5 comments
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@j-bennet
Copy link

j-bennet commented Mar 31, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I'm trying to measure Pandas dataframe memory usage with string[pyarrow] vs object. While .memory_usage(deep=True) shows string[pyarrow] occupying less memory than object, the results are very different when measuring memory usage with psutil. Repro script:

# proc_mem.py
import psutil
import os
import random
import string
import pandas as pd
import gc
import sys

def format_mb(n):
    return f"{n / 1024 ** 2:,.2f} MiB"

def random_string():
    return "".join(random.choices(string.ascii_letters + string.digits + " ", k=random.randint(10, 100)))

def random_strings(n, n_unique=None):
    if n_unique is None:
        n_unique = n
    if n == n_unique:
        return (random_string() for _ in range(n_unique))
    choices = [random_string() for _ in range(n_unique)]
    return (random.choice(choices) for _ in range(n))

if __name__ == "__main__":
    if len(sys.argv) == 4:
        string_dtype = sys.argv[1]
        N = int(sys.argv[2])
        N_UNIQUE = int(sys.argv[3])
    else:
        print(f"Usage: {sys.argv[0]} [STRING_DTYPE] [N] [N_UNIQUE]")
        exit(1)
    process = psutil.Process(os.getpid())
    mem1 = process.memory_info().rss
    print(f"{string_dtype = }, {N = :,}, {N_UNIQUE = :,}")
    print(f"before: {format_mb(mem1)}")
    s = pd.Series(random_strings(N, N_UNIQUE), dtype=string_dtype, copy=True)
    mem_s = s.memory_usage(deep=True)
    print(f"pandas reported: {format_mb(mem_s)}")
    mem2 = process.memory_info().rss
    print(f"psutil reported: {format_mb(mem2 - mem1)}")
    del s
    gc.collect()
    mem3 = process.memory_info().rss
    print(f"released: {format_mb(mem2 - mem3)}")

This produces the following output:

> python proc_mem.py 'object' 1_000_000 1_000_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.80 MiB
pandas reported: 106.84 MiB
psutil reported: 129.78 MiB
released: 104.98 MiB

> python proc_mem.py 'string[pyarrow]' 1_000_000 1_000_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 1,000,000
before: 82.39 MiB
pandas reported: 56.25 MiB
psutil reported: 117.33 MiB
released: 52.39 MiB

pandas tells me that string[pyarrow] Series consumes 56Mb of memory, while psutil reports increase of 117Mb.

With object dtype, pandas and psutil memory usage is very close. But if not all strings in my dataset are unique, the picture is different:

> python proc_mem.py 'object' 1_000_000 100_000
string_dtype = 'object', N = 1,000,000, N_UNIQUE = 100,000
before: 82.81 MiB
pandas reported: 106.81 MiB
psutil reported: 34.14 MiB
released: 7.98 MiB

> python proc_mem.py 'string[pyarrow]' 1_000_000 100_000
string_dtype = 'string[pyarrow]', N = 1,000,000, N_UNIQUE = 100,000
before: 83.05 MiB
pandas reported: 56.38 MiB
psutil reported: 119.30 MiB
released: 52.53 MiB

Nothing changed with string[pyarrow]. pandas still reports 2x less memory usage than psutil.

With object, pandas now tells me that memory usage is 106Mb, while psutil sees only a 34Mb increase, and only 7Mb is released after gc.collect().

What is going on here, and how can I explain these results?

I saw these as well, they are possibly related:

Installed Versions

INSTALLED VERSIONS

commit : c2a7f1a
python : 3.11.0.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0rc1
numpy : 1.24.2
pytz : 2023.2
dateutil : 2.8.2
setuptools : 67.6.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.12.0
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Prior Performance

No response

@j-bennet j-bennet added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 31, 2023
@j-bennet
Copy link
Author

cc @jrbourbeau

@rhshadrach
Copy link
Member

Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434

@j-bennet
Copy link
Author

Is part of this due to Python's string caching? https://stackoverflow.com/a/16757434

Pretty sure this is the case for the not-all-unique dataset, yes. But then, shouldn't pandas know that the data doesn't really occupy 106Mb?

@j-bennet
Copy link
Author

I see more than one thing being weird in this experiment:

  1. if arrow strings should only occupy 56Mb according to pandas, why do I see a 117Mb bump in process memory (~2x)
  2. with only 1/10 strings being unique, assuming cpython string caching is a factor, why does Pandas still estimate memory usage as 106Mb.

Basically, from this experiment, it looks like I don't save any memory by using arrow strings, and if my data is not completely unique, objects would actually fare better, because of internal Python optimizations. Is that the conclusion I should make?

@DeaMariaLeon DeaMariaLeon added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023
@rhshadrach
Copy link
Member

But then, shouldn't pandas know that the data doesn't really occupy 106Mb?

I don't see why/how pandas would know this. Currently we just take the size for each value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

3 participants