Skip to content

PERF: groupby aggregations on pyarrow timestamp and duration types #55131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

lukemanley
Copy link
Member

Seeing ~identical timing between pyarrow and numpy in the OP example with this PR.

import pandas as pd
import numpy as np

N = 1_000_000

df = pd.DataFrame(
    {
        "group": np.arange(N//1000).repeat(1000),
        "timestamp": pd.array(np.arange(N), dtype="timestamp[ns][pyarrow]"),
        "duration": pd.array(np.arange(N), dtype="duration[s][pyarrow]"),
    }
)

gb = df.groupby("group")

%timeit gb["timestamp"].max()
# 215 ms ± 9.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
# 13 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

%timeit gb["duration"].max()
# 272 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
# 12.8 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

@lukemanley lukemanley added Datetime Datetime data dtype Performance Memory or execution speed performance Arrow pyarrow functionality labels Sep 13, 2023
@lukemanley lukemanley added this to the 2.2 milestone Sep 13, 2023
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - nice! lgtm

@mroeschke mroeschke merged commit cc58350 into pandas-dev:main Sep 15, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

hedeershowk pushed a commit to hedeershowk/pandas that referenced this pull request Sep 20, 2023
…andas-dev#55131)

* PERF: groupby aggregations on pyarrow timestamp and duration types

* mypy

* update
@lukemanley lukemanley deleted the arrow-datelike-groupby branch November 16, 2023 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Datetime Datetime data dtype Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Aggregations on timestamp[ns][pyarrow] extremely slow compared to datetime64[ns]
3 participants