PERF: Melt 2x slower when future.infer_string option enabled #59657

maver1ck · 2024-08-30T07:27:56Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# This configuration option makes this code slow
pd.options.future.infer_string = True

# Define dimensions
n_rows = 10000
n_cols = 10000

# Generate random IDs for the rows
ids = [f"string_id_{i}" for i in range(1, n_rows + 1)]

# Generate a random sparse matrix with 10% non-NaN values
data = np.random.choice([np.nan, 1], size=(n_rows, n_cols), p=[0.9, 0.1])

# Create a DataFrame from the sparse matrix and add the 'Id' column
df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])
df.insert(0, 'Id', ids)

# Melt the DataFrame
df_melted = df.melt(id_vars=['Id'], var_name='Column', value_name='Value')

# Display the first few rows of the melted DataFrame
df_melted.head()

Installed Versions

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.12.5.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.6.0
Version               : Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : pl_PL.UTF-8
LOCALE                : pl_PL.UTF-8

pandas                : 2.2.2
numpy                 : 2.1.0
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 73.0.1
pip                   : 24.1.2
Cython                : None
pytest                : 8.3.2
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.4
IPython               : 8.26.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.6.1
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 17.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : 2024.6.1
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Prior Performance

This code with pd.options.future.infer_string = False runs in:
5.23 s ± 1.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Memory consumption is around 14 GB.

Enabling pd.options.future.infer_string = True makes it 2 times slower:
10.6 s ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Also memory consumption is bigger with peak around 25GB.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-09-02T20:11:00Z

@maver1ck Thanks for the report!

On main (and on my laptop), I see:

In [20]: pd.options.future.infer_string = False

In [21]: df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])

In [22]: df.insert(0, 'Id', ids)

In [23]: %timeit df_melted = df.melt(id_vars=['Id'], var_name='Column', value_name='Value')
6.25 s ± 944 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: pd.options.future.infer_string = True

In [25]: df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])

In [26]: df.insert(0, 'Id', ids)

In [27]: %timeit df.melt(id_vars=['Id'], var_name='Column', value_name='Value')
3.55 s ± 169 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So for me it is actually two times faster (didn't check memory usage though)

jorisvandenbossche · 2024-09-02T20:20:29Z

And testing with release pandas 2.2.2, I indeed see that it is slower with pd.options.future.infer_string = True. So it seems we have fixed something in the meantime.

maver1ck · 2024-09-30T11:02:53Z

The same problem exists in Pandas 2.2.3.
So my understanding is that this will be fixed in 3.0 ?
@jorisvandenbossche is that correct ?

jorisvandenbossche · 2024-12-01T12:57:28Z

The reason this is faster for main / 3.0, is because of #57205. That is not backported, and I just created a smaller more targeted fix that we can backport to 2.3.x that also resolves this performance issue with infer_string -> #60461

maver1ck added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Aug 30, 2024

maver1ck changed the title ~~PERF:~~ PERF: Melt 2x slower when future.infer_string option enabled Aug 30, 2024

jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 2, 2024

mike-jn mentioned this issue Nov 28, 2024

RLS: 2.3 #59664

Open

5 tasks

jorisvandenbossche mentioned this issue Dec 1, 2024

PERF: improve construct_1d_object_array_from_listlike #60461

Merged

5 tasks

mroeschke closed this as completed in #60461 Dec 3, 2024

TendouArisu mentioned this issue Feb 16, 2025

Potential performance issue: Unreliable performance of df.melt in pandas 2.2.2 GeoscienceAustralia/dea-intertidal#103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Melt 2x slower when future.infer_string option enabled #59657

PERF: Melt 2x slower when future.infer_string option enabled #59657

maver1ck commented Aug 30, 2024

jorisvandenbossche commented Sep 2, 2024 •

edited

Loading

jorisvandenbossche commented Sep 2, 2024

maver1ck commented Sep 30, 2024

jorisvandenbossche commented Dec 1, 2024

PERF: Melt 2x slower when future.infer_string option enabled #59657

PERF: Melt 2x slower when future.infer_string option enabled #59657

Comments

maver1ck commented Aug 30, 2024

Pandas version checks

Reproducible Example

Installed Versions

Prior Performance

jorisvandenbossche commented Sep 2, 2024 • edited Loading

jorisvandenbossche commented Sep 2, 2024

maver1ck commented Sep 30, 2024

jorisvandenbossche commented Dec 1, 2024

jorisvandenbossche commented Sep 2, 2024 •

edited

Loading