PERF: faster constructors from ea scalars #45854

lukemanley · 2022-02-07T04:03:08Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Perf improvement when constructing a DataFrame/Series from a scalar EA value.

Examples:

import pandas as pd

N = 1_000_000


%timeit pd.DataFrame({"A": pd.NA, "B": 1.0}, index=range(N), dtype=pd.Float64Dtype())
2.18 s ± 214 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       <- main
22.4 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- PR


%timeit pd.Series(pd.NA, index=range(N), dtype=pd.Float64Dtype())
1.62 s ± 50.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      <- main
7.33 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR


%timeit pd.Series(1, index=range(N), dtype='Int64')
242 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)      <- main
11.6 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <- PR

jbrockmendel · 2022-02-07T21:42:04Z

pandas/core/dtypes/cast.py

@@ -1652,7 +1652,12 @@ def construct_1d_arraylike_from_scalar(

    if isinstance(dtype, ExtensionDtype):
        cls = dtype.construct_array_type()
-        subarr = cls._from_sequence([value] * length, dtype=dtype)
+        if isinstance(dtype, CategoricalDtype):
+            subarr = cls._from_sequence([value] * length, dtype=dtype)


what happens if you don't special case Categorical?

Calling take on the empty categorical would error because the categories were not defined.

I just pushed another commit which first sets the categories. Categorical now has a similar improvement:

import pandas as pd N = 1_000_000 %timeit pd.Series(1, index=range(N), dtype='category') 415 ms ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- main 4.5 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR

Calling take on the empty categorical would error because the categories were not defined.

Hmm i think this might be a problem for other not-fully-initialized dtype objects. IIRC IntervalDtype without 'closed' set is another one, not sure if we have a way to detect the general case

latest update uses .repeat per @jreback's suggestion below.

jreback · 2022-02-07T23:50:49Z

pandas/core/dtypes/cast.py

@@ -1652,7 +1652,9 @@ def construct_1d_arraylike_from_scalar(

    if isinstance(dtype, ExtensionDtype):
        cls = dtype.construct_array_type()
-        subarr = cls._from_sequence([value] * length, dtype=dtype)
+        subarr = cls._from_sequence([value], dtype=dtype)
+        taker = np.broadcast_to(np.intp(0), length)


can we just use .repeat()? (I think we define that generally on EA)

Yes, that's even better - just updated. Thanks for pointing it out.

jreback · 2022-02-09T00:28:49Z

very nice @lukemanley keep em coming!

lukemanley added 2 commits February 6, 2022 22:34

faster constructors from ea scalars

f02b294

handle categoricals

8b45e67

lukemanley added Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance labels Feb 7, 2022

whatsnew

9f58d27

jbrockmendel reviewed Feb 7, 2022

View reviewed changes

lukemanley added 2 commits February 7, 2022 18:12

faster categorical constructor

660ee13

cleanup

71fa91e

jreback reviewed Feb 7, 2022

View reviewed changes

jreback added this to the 1.5 milestone Feb 7, 2022

lukemanley added 2 commits February 7, 2022 19:05

use .repeat

8e3f506

fix failing tests

98c3406

jreback merged commit ab6901c into pandas-dev:main Feb 9, 2022

phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022

PERF: faster constructors from ea scalars (pandas-dev#45854)

5063ad9

lukemanley deleted the ea-from-scalar branch March 2, 2022 01:13

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

PERF: faster constructors from ea scalars (pandas-dev#45854)

2dd3128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: faster constructors from ea scalars #45854

PERF: faster constructors from ea scalars #45854

lukemanley commented Feb 7, 2022 •

edited

Loading

jbrockmendel Feb 7, 2022

lukemanley Feb 7, 2022

jbrockmendel Feb 7, 2022

lukemanley Feb 8, 2022

jreback Feb 7, 2022 •

edited

Loading

lukemanley Feb 8, 2022

jreback commented Feb 9, 2022

PERF: faster constructors from ea scalars #45854

PERF: faster constructors from ea scalars #45854

Conversation

lukemanley commented Feb 7, 2022 • edited Loading

jbrockmendel Feb 7, 2022

Choose a reason for hiding this comment

lukemanley Feb 7, 2022

Choose a reason for hiding this comment

jbrockmendel Feb 7, 2022

Choose a reason for hiding this comment

lukemanley Feb 8, 2022

Choose a reason for hiding this comment

jreback Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

lukemanley Feb 8, 2022

Choose a reason for hiding this comment

jreback commented Feb 9, 2022

lukemanley commented Feb 7, 2022 •

edited

Loading

jreback Feb 7, 2022 •

edited

Loading