PERF: high memory consumption for unstack #54373

hkad98 · 2023-08-02T12:26:13Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I use memory_profiler. Run mprof run file_name to measure memory consumption on each line of code.

import pandas as pd
import random
import string
from memory_profiler import profile


def random_string():
    return ''.join(random.choices(string.ascii_letters, k=7))


@profile
def main():
    records_count = 63531
    df = pd.DataFrame(
        {
            "A": random.choices([random_string() for _ in range(24)], k=records_count),
            "B": random.choices([random_string() for _ in range(14580)], k=records_count),
            "C": random.choices([random_string() for _ in range(9)], k=records_count),
            "D": random.choices([random_string() for _ in range(2311)], k=records_count),
            "E": random.choices([random_string() for _ in range(2)], k=records_count),
            "F": random.choices([random_string() for _ in range(280)], k=records_count),
            "M": random.sample(range(0, records_count), records_count)
        }
    )

    grouped_df = df.groupby(["A", "B", "C", "D", "E", "F"], dropna=False)[["M"]].sum(min_count=1, numeric_only=False)
    grouped_df.unstack("F")


if __name__ == "__main__":
    main()

Memory usage for unstack:

    27    264.1 MiB    171.8 MiB           1       grouped_df.unstack("F")

I tried to improve memory consumption with the following changes:

use category instead of string
use pivot_table instead of groupby + unstack
use reset_index + pivot instead of unstack

None of the things above worked.

I tried to turn on CoW, but it made it worse.

  28    390.6 MiB    306.1 MiB           1       grouped_df.unstack("F")

I am aware that CoW does not support unstack yet (#49473), but I would not expect that turning it on will make unstack worse.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.10.8.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.UTF-8

pandas : 2.0.3
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

phofl · 2023-08-06T17:29:13Z

Hm I can't reproduce this. Also on an ARM Mac

 29    141.8 MiB     34.1 MiB           1       grouped_df.unstack("F")

Can you provide a simple reproducer? Less columns and without groupby

hkad98 · 2023-08-06T18:11:21Z

@phofl I will try to find a simpler reproducer. May I ask how grouped_df.unstack("F") in your reproducer was executed on line 29? Did you add any lines? What version of pandas did you use? What is wrong with groupby?

phofl · 2023-08-06T18:12:52Z

I added one line to activate/deactivate CoW. Otherwise this was copied as is. Examples should always be as simple as possible, e.g. no unnecessary operations. If your problem occurs in unstack, then groupby is unnecessary

hkad98 · 2023-08-06T18:20:13Z

@phofl what about the following code?

import pandas as pd
from memory_profiler import profile

@profile
def main():
    df = pd.read_parquet("reproducer.parquet")
    df.unstack("F")


if __name__ == "__main__":
    main()

I put reproducer.parquet in archiv.zip.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    11     77.0 MiB     77.0 MiB           1   @profile
    12                                         def main():
    13    115.3 MiB     38.3 MiB           1       df = pd.read_parquet("reproducer.parquet")
    14    290.0 MiB    174.8 MiB           1       df.unstack("F")

Still getting huge memory consumption for unstack.

phofl · 2023-08-06T18:24:23Z

Nope, still can't reproduce

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4    146.7 MiB    146.7 MiB           1   @profile
     5                                         def main():
     6    178.8 MiB     32.1 MiB           1       df = pd.read_parquet("reproducer.parquet")
     7    215.9 MiB     37.0 MiB           1       df.unstack("F")

hkad98 · 2023-08-06T18:27:58Z

That is strange. What version of Pandas do you use?

phofl · 2023-08-06T18:28:33Z

tried on main and 2.0.3

hkad98 · 2023-08-07T14:58:12Z

Hi @phofl, I tried to reproduce the issue independently, and I think I succeeded. See the following runs in my public repo.

Ubuntu: https://github.com/hkad98/pandas-reproducer/actions/runs/5784679774/job/15675843719
MacOS: https://github.com/hkad98/pandas-reproducer/actions/runs/5784759695/job/15676084594

Note that both runners use x86

Unstack increment for:

Ubuntu 5.2 MiB
MacOS 29.2 MiB (~6 times more)

Unfortunately, GitHub does not provide runners with ARM architecture. I tried locally running the same script with Ubuntu + ARM, and the results were the same as Ubuntu with x86. I think that issue is in the combination of macOS13 with ARM.

mroeschke · 2024-05-31T22:09:18Z

Seems like the original issue wasn't directly reproducible so closing.

hkad98 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Aug 2, 2023

phofl added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 6, 2023

phofl added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2023

mroeschke closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: high memory consumption for unstack #54373

PERF: high memory consumption for unstack #54373

hkad98 commented Aug 2, 2023

INSTALLED VERSIONS

phofl commented Aug 6, 2023

hkad98 commented Aug 6, 2023

phofl commented Aug 6, 2023

hkad98 commented Aug 6, 2023

phofl commented Aug 6, 2023

hkad98 commented Aug 6, 2023

phofl commented Aug 6, 2023

hkad98 commented Aug 7, 2023

mroeschke commented May 31, 2024

PERF: high memory consumption for unstack #54373

PERF: high memory consumption for unstack #54373

Comments

hkad98 commented Aug 2, 2023

Pandas version checks

Reproducible Example

Installed Versions

INSTALLED VERSIONS

Prior Performance

phofl commented Aug 6, 2023

hkad98 commented Aug 6, 2023

phofl commented Aug 6, 2023

hkad98 commented Aug 6, 2023

phofl commented Aug 6, 2023

hkad98 commented Aug 6, 2023

phofl commented Aug 6, 2023

hkad98 commented Aug 7, 2023

mroeschke commented May 31, 2024