Skip to content

PERF: faster constructors from ea scalars #45854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 9, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 7, 2022

Perf improvement when constructing a DataFrame/Series from a scalar EA value.

Examples:

import pandas as pd

N = 1_000_000


%timeit pd.DataFrame({"A": pd.NA, "B": 1.0}, index=range(N), dtype=pd.Float64Dtype())
2.18 s ± 214 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       <- main
22.4 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- PR


%timeit pd.Series(pd.NA, index=range(N), dtype=pd.Float64Dtype())
1.62 s ± 50.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      <- main
7.33 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR


%timeit pd.Series(1, index=range(N), dtype='Int64')
242 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      <- main
11.6 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <- PR

@lukemanley lukemanley added Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance labels Feb 7, 2022
@@ -1652,7 +1652,12 @@ def construct_1d_arraylike_from_scalar(

if isinstance(dtype, ExtensionDtype):
cls = dtype.construct_array_type()
subarr = cls._from_sequence([value] * length, dtype=dtype)
if isinstance(dtype, CategoricalDtype):
subarr = cls._from_sequence([value] * length, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you don't special case Categorical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling take on the empty categorical would error because the categories were not defined.

I just pushed another commit which first sets the categories. Categorical now has a similar improvement:

import pandas as pd

N = 1_000_000

%timeit pd.Series(1, index=range(N), dtype='category')
415 ms ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
4.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling take on the empty categorical would error because the categories were not defined.

Hmm i think this might be a problem for other not-fully-initialized dtype objects. IIRC IntervalDtype without 'closed' set is another one, not sure if we have a way to detect the general case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest update uses .repeat per @jreback's suggestion below.

@@ -1652,7 +1652,9 @@ def construct_1d_arraylike_from_scalar(

if isinstance(dtype, ExtensionDtype):
cls = dtype.construct_array_type()
subarr = cls._from_sequence([value] * length, dtype=dtype)
subarr = cls._from_sequence([value], dtype=dtype)
taker = np.broadcast_to(np.intp(0), length)
Copy link
Contributor

@jreback jreback Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use .repeat()? (I think we define that generally on EA)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's even better - just updated. Thanks for pointing it out.

@jreback jreback added this to the 1.5 milestone Feb 7, 2022
@jreback jreback merged commit ab6901c into pandas-dev:main Feb 9, 2022
@jreback
Copy link
Contributor

jreback commented Feb 9, 2022

very nice @lukemanley keep em coming!

phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
@lukemanley lukemanley deleted the ea-from-scalar branch March 2, 2022 01:13
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants