Skip to content

PERF: high memory consumption for unstack #54373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
hkad98 opened this issue Aug 2, 2023 · 9 comments
Closed
3 tasks done

PERF: high memory consumption for unstack #54373

hkad98 opened this issue Aug 2, 2023 · 9 comments
Labels
Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@hkad98
Copy link

hkad98 commented Aug 2, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I use memory_profiler. Run mprof run file_name to measure memory consumption on each line of code.

import pandas as pd
import random
import string
from memory_profiler import profile


def random_string():
    return ''.join(random.choices(string.ascii_letters, k=7))


@profile
def main():
    records_count = 63531
    df = pd.DataFrame(
        {
            "A": random.choices([random_string() for _ in range(24)], k=records_count),
            "B": random.choices([random_string() for _ in range(14580)], k=records_count),
            "C": random.choices([random_string() for _ in range(9)], k=records_count),
            "D": random.choices([random_string() for _ in range(2311)], k=records_count),
            "E": random.choices([random_string() for _ in range(2)], k=records_count),
            "F": random.choices([random_string() for _ in range(280)], k=records_count),
            "M": random.sample(range(0, records_count), records_count)
        }
    )

    grouped_df = df.groupby(["A", "B", "C", "D", "E", "F"], dropna=False)[["M"]].sum(min_count=1, numeric_only=False)
    grouped_df.unstack("F")


if __name__ == "__main__":
    main()

Memory usage for unstack:

    27    264.1 MiB    171.8 MiB           1       grouped_df.unstack("F")

I tried to improve memory consumption with the following changes:

  • use category instead of string
  • use pivot_table instead of groupby + unstack
  • use reset_index + pivot instead of unstack

None of the things above worked.

I tried to turn on CoW, but it made it worse.

  28    390.6 MiB    306.1 MiB           1       grouped_df.unstack("F")

I am aware that CoW does not support unstack yet (#49473), but I would not expect that turning it on will make unstack worse.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.10.8.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.UTF-8

pandas : 2.0.3
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 63.2.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response

@hkad98 hkad98 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Aug 2, 2023
@phofl phofl added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 6, 2023
@phofl
Copy link
Member

phofl commented Aug 6, 2023

Hm I can't reproduce this. Also on an ARM Mac

 29    141.8 MiB     34.1 MiB           1       grouped_df.unstack("F")

Can you provide a simple reproducer? Less columns and without groupby

@phofl phofl added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 6, 2023
@hkad98
Copy link
Author

hkad98 commented Aug 6, 2023

@phofl I will try to find a simpler reproducer. May I ask how grouped_df.unstack("F") in your reproducer was executed on line 29? Did you add any lines? What version of pandas did you use? What is wrong with groupby?

@phofl
Copy link
Member

phofl commented Aug 6, 2023

I added one line to activate/deactivate CoW. Otherwise this was copied as is. Examples should always be as simple as possible, e.g. no unnecessary operations. If your problem occurs in unstack, then groupby is unnecessary

@hkad98
Copy link
Author

hkad98 commented Aug 6, 2023

@phofl what about the following code?

import pandas as pd
from memory_profiler import profile

@profile
def main():
    df = pd.read_parquet("reproducer.parquet")
    df.unstack("F")


if __name__ == "__main__":
    main()

I put reproducer.parquet in archiv.zip.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    11     77.0 MiB     77.0 MiB           1   @profile
    12                                         def main():
    13    115.3 MiB     38.3 MiB           1       df = pd.read_parquet("reproducer.parquet")
    14    290.0 MiB    174.8 MiB           1       df.unstack("F")

Still getting huge memory consumption for unstack.

@phofl
Copy link
Member

phofl commented Aug 6, 2023

Nope, still can't reproduce

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     4    146.7 MiB    146.7 MiB           1   @profile
     5                                         def main():
     6    178.8 MiB     32.1 MiB           1       df = pd.read_parquet("reproducer.parquet")
     7    215.9 MiB     37.0 MiB           1       df.unstack("F")

@hkad98
Copy link
Author

hkad98 commented Aug 6, 2023

That is strange. What version of Pandas do you use?

@phofl
Copy link
Member

phofl commented Aug 6, 2023

tried on main and 2.0.3

@hkad98
Copy link
Author

hkad98 commented Aug 7, 2023

Hi @phofl, I tried to reproduce the issue independently, and I think I succeeded. See the following runs in my public repo.

Ubuntu: https://github.com/hkad98/pandas-reproducer/actions/runs/5784679774/job/15675843719
MacOS: https://github.com/hkad98/pandas-reproducer/actions/runs/5784759695/job/15676084594

Note that both runners use x86

Unstack increment for:

  • Ubuntu 5.2 MiB
  • MacOS 29.2 MiB (~6 times more)

Unfortunately, GitHub does not provide runners with ARM architecture. I tried locally running the same script with Ubuntu + ARM, and the results were the same as Ubuntu with x86. I think that issue is in the combination of macOS13 with ARM.

@mroeschke
Copy link
Member

Seems like the original issue wasn't directly reproducible so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants