Skip to content

PERF: DataFrame.round() unnecessarily slow copared to np.round() #17254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aberres opened this issue Aug 15, 2017 · 2 comments · Fixed by #51498
Closed

PERF: DataFrame.round() unnecessarily slow copared to np.round() #17254

aberres opened this issue Aug 15, 2017 · 2 comments · Fixed by #51498
Assignees
Labels
Performance Memory or execution speed performance

Comments

@aberres
Copy link
Contributor

aberres commented Aug 15, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

np_df = np.random.randn(10000, 4000)
df = pd.DataFrame(np_df)

%timeit np.round(np_df, 2)
# 416 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.round(2)
# 1.69 s ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.round(df, 2)
# 1.74 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

Completely unexpected DataFrage.round() showed up as a major hotspot during profiling.
When looking at the code, we see that even when rounding the complete data frame to a given number of decimals it is split into series objects which are then rounded.

I am wondering if there is a reason not to pass the underlying data frame to numpy and do the rounding there in this case.

A quick test showed that something like this would give us the numpy performance:

def faster_round(df, decimals):
    rounded = np.round(df.values, decimals)
    return pd.DataFrame(rounded, columns=df.columns, index=df.index)

%timeit faster_round(df, 2)
# 417 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@gfyoung gfyoung added the Performance Memory or execution speed performance label Aug 15, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 15, 2017

@aberres : Good question! Ultimately, we doing up touch the get_values method for many pandas objects, which returns ndarray. Perhaps we could re-implement to avoid all of this indirection, though be careful to ensure nothing breaks.

@jreback @jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Aug 15, 2017

this is done column by column, it could instead be per-dtype as a block. would require some amount of work to do this. pull-requests are welcome.

@jreback jreback added this to the Next Major Release milestone Aug 15, 2017
@jreback jreback changed the title DataFrame.round() unnecessarily slow copared to np.round() PERF: DataFrame.round() unnecessarily slow copared to np.round() Aug 15, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@lithomas1 lithomas1 self-assigned this Feb 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants