-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: faster constructors from ea scalars #45854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/dtypes/cast.py
Outdated
@@ -1652,7 +1652,12 @@ def construct_1d_arraylike_from_scalar( | |||
|
|||
if isinstance(dtype, ExtensionDtype): | |||
cls = dtype.construct_array_type() | |||
subarr = cls._from_sequence([value] * length, dtype=dtype) | |||
if isinstance(dtype, CategoricalDtype): | |||
subarr = cls._from_sequence([value] * length, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if you don't special case Categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling take
on the empty categorical would error because the categories were not defined.
I just pushed another commit which first sets the categories. Categorical now has a similar improvement:
import pandas as pd
N = 1_000_000
%timeit pd.Series(1, index=range(N), dtype='category')
415 ms ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main
4.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling take on the empty categorical would error because the categories were not defined.
Hmm i think this might be a problem for other not-fully-initialized dtype objects. IIRC IntervalDtype without 'closed' set is another one, not sure if we have a way to detect the general case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest update uses .repeat
per @jreback's suggestion below.
pandas/core/dtypes/cast.py
Outdated
@@ -1652,7 +1652,9 @@ def construct_1d_arraylike_from_scalar( | |||
|
|||
if isinstance(dtype, ExtensionDtype): | |||
cls = dtype.construct_array_type() | |||
subarr = cls._from_sequence([value] * length, dtype=dtype) | |||
subarr = cls._from_sequence([value], dtype=dtype) | |||
taker = np.broadcast_to(np.intp(0), length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just use .repeat()
? (I think we define that generally on EA)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's even better - just updated. Thanks for pointing it out.
very nice @lukemanley keep em coming! |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Perf improvement when constructing a DataFrame/Series from a scalar EA value.
Examples: