Skip to content

PERF: Melt 2x slower when future.infer_string option enabled #59657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
maver1ck opened this issue Aug 30, 2024 · 4 comments · Fixed by #60461
Closed
3 tasks done

PERF: Melt 2x slower when future.infer_string option enabled #59657

maver1ck opened this issue Aug 30, 2024 · 4 comments · Fixed by #60461
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data

Comments

@maver1ck
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# This configuration option makes this code slow
pd.options.future.infer_string = True

# Define dimensions
n_rows = 10000
n_cols = 10000

# Generate random IDs for the rows
ids = [f"string_id_{i}" for i in range(1, n_rows + 1)]

# Generate a random sparse matrix with 10% non-NaN values
data = np.random.choice([np.nan, 1], size=(n_rows, n_cols), p=[0.9, 0.1])

# Create a DataFrame from the sparse matrix and add the 'Id' column
df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])
df.insert(0, 'Id', ids)

# Melt the DataFrame
df_melted = df.melt(id_vars=['Id'], var_name='Column', value_name='Value')

# Display the first few rows of the melted DataFrame
df_melted.head()

Installed Versions

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.12.5.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.6.0
Version               : Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : pl_PL.UTF-8
LOCALE                : pl_PL.UTF-8

pandas                : 2.2.2
numpy                 : 2.1.0
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 73.0.1
pip                   : 24.1.2
Cython                : None
pytest                : 8.3.2
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.4
IPython               : 8.26.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.6.1
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 17.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : 2024.6.1
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Prior Performance

This code with pd.options.future.infer_string = False runs in:
5.23 s ± 1.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Memory consumption is around 14 GB.

Enabling pd.options.future.infer_string = True makes it 2 times slower:
10.6 s ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Also memory consumption is bigger with peak around 25GB.

@maver1ck maver1ck added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Aug 30, 2024
@maver1ck maver1ck changed the title PERF: PERF: Melt 2x slower when future.infer_string option enabled Aug 30, 2024
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 2, 2024

@maver1ck Thanks for the report!

On main (and on my laptop), I see:

In [20]: pd.options.future.infer_string = False

In [21]: df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])

In [22]: df.insert(0, 'Id', ids)

In [23]: %timeit df_melted = df.melt(id_vars=['Id'], var_name='Column', value_name='Value')
6.25 s ± 944 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: pd.options.future.infer_string = True

In [25]: df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])

In [26]: df.insert(0, 'Id', ids)

In [27]: %timeit df.melt(id_vars=['Id'], var_name='Column', value_name='Value')
3.55 s ± 169 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So for me it is actually two times faster (didn't check memory usage though)

@jorisvandenbossche
Copy link
Member

And testing with release pandas 2.2.2, I indeed see that it is slower with pd.options.future.infer_string = True. So it seems we have fixed something in the meantime.

@jorisvandenbossche jorisvandenbossche added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 2, 2024
@maver1ck
Copy link
Author

The same problem exists in Pandas 2.2.3.
So my understanding is that this will be fixed in 3.0 ?
@jorisvandenbossche is that correct ?

@jorisvandenbossche
Copy link
Member

The reason this is faster for main / 3.0, is because of #57205. That is not backported, and I just created a smaller more targeted fix that we can backport to 2.3.x that also resolves this performance issue with infer_string -> #60461

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants