PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

jorisvandenbossche · 2021-03-08T08:55:13Z

This ensures that the 2D ndarray that is created in get_dummies for a single column is directly in the "right" layout as the internals prefer it (column-major, so it's row-major in the transposed Block.values).

For ArrayManager, this gives a speed-up for get_dummies itself (because we still copy the arrays to ensure proper contiguous memory layout for each column, currently):

In [7]: %timeit pd.get_dummies(s, sparse=False)
23.2 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
16.6 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

For the current BlockManager, it makes the actual get_dummies a bit slower, but can make a follow-up operation actually faster (where a copy, or consolidation, or a reduction, or .. would happen):

In [4]: %timeit pd.get_dummies(s, sparse=False)
10.7 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- master
14.2 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

In [5]: %timeit pd.get_dummies(s, sparse=False).copy()
22.3 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
17.8 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

The timing is from the following example taken from the frame_methods.py::GetDummies ASV case:

import string

categories = list(string.ascii_letters[:12])
s = pd.Series(
    np.random.choice(categories, size=1000000),
    dtype=pd.CategoricalDtype(categories),
)

So it makes the actual get_dummies a bit slower for the BlockManager, and thus I could also explicitly only do this for the ArrayManager. But since it improves follow-up operations, I think it can actually be beneficial overall for BlockManager as well. But either is fine for me.

…ternals

jreback · 2021-03-08T13:15:12Z

pandas/core/reshape/reshape.py

@@ -1014,7 +1014,8 @@ def get_empty_frame(data) -> DataFrame:
        return out

    else:
-        dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
+        # take on axis=1 + transpose to ensure ndarray layout is column-major


use the order paramater here: https://numpy.org/doc/stable/reference/generated/numpy.eye.html or maybe .reshape (as .take gives a copy)

To clarify: the take is needed for the actual get_dummies operation.
But given your suggestion, I thought that specifying order in eye might speed up the take call, but from a quick test that doesn't seem to matter

guess this is fine

jorisvandenbossche · 2021-03-15T14:02:34Z

@jreback any further feedback here?

jorisvandenbossche · 2021-03-15T14:13:30Z

Thanks!

simonjayhawkins · 2021-07-01T15:38:11Z

So it makes the actual get_dummies a bit slower for the BlockManager, and thus I could also explicitly only do this for the ArrayManager. But since it improves follow-up operations, I think it can actually be beneficial overall for BlockManager as well. But either is fine for me.

https://pandas.pydata.org/speed/pandas/#reshape.GetDummies.time_get_dummies_1d?python=3.8&Cython=0.29.21&commits=b8a71f33-154096f2

…ternals (pandas-dev#40299)

jbrockmendel · 2021-09-22T02:09:57Z

pandas/core/reshape/reshape.py

@@ -1039,7 +1039,8 @@ def get_empty_frame(data) -> DataFrame:
        return out

    else:
-        dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
+        # take on axis=1 + transpose to ensure ndarray layout is column-major
+        dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=1).T


@jorisvandenbossche investigating a perf regression in reshape.GetDummies.time_get_dummies_1d current best guess is this line. Would doing arr.T.take(codes, axis=0) be equivalent here? That's about 7x faster for me locally.

@jorisvandenbossche gentle ping; time_get_dummies_1d is the biggest perf regression im seeing vs 1.2.0

@jorisvandenbossche gentle ping

Can you point to the performance regression? Is it something that showed up in the online ASV bencharks? (or just the one that Simon mentioned above?)

Note that I mentioned above that this will actually cause a slowdown for BlockManager if you isolate the get_dummies, so some performance regression was anticipated (see the timings in the top post, which has both a timing for only get_dummies, and a timing for get_dummies + copy). As I noted back then, it could also special case this to do it only for ArrayManager.

It's the same one simon linked, yes, but I see a much worse deterioriation locally (likely measuring from a different baseline):

from asv_bench.benchmarks.reshape import * setup() self = GetDummies() self.setup() %timeit self.time_get_dummies_1d() 5.79 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- 1.2.0 43.1 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master

profiling with %prun -s cumtime for n in range(100): self.time_get_dummies_1d() traces essentially the whole change back to this .take

I see:

In [2]: %timeit pd.get_dummies(s, sparse=False) 4.22 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.2.5 12.6 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.3.3 12.7 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- master In [3]: %timeit pd.get_dummies(s, sparse=False).copy() 14.7 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.2.5 13.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.3.3 13.2 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- master

(s is defined in a snippet in the top post, but it's the same as the ASV benchmark you did)

How does using arr.T.take(codes, axis=0) affect these for you? Locally making that change does make the .copy slower, but by a much smaller margin than the get_dummies becomes faster

In [3]: %timeit pd.get_dummies(s, sparse=False) 43 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master 6.36 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- transpose-before-take In [4]: %timeit pd.get_dummies(s, sparse=False).copy() 47.6 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master 15.7 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- transpose-before-take

How does using arr.T.take(codes, axis=0) affect these for you?

See the explanation in the top post of this PR. The intent (I didn't check again in detail to be clear, just rereading what I wrote back then) was to ensure that the data was in the best memory layout for pandas' internals.
This was more important for the ArrayManager (because there it always needs to be contiguous by column, while for a 2D block we can store it either way. But even for 2D blocks, some operations will be faster afterwards if the layout is column-major (i.e. row-major in the transposed way how we store it in the Block)). That's also why I mentioned that we could do a if/else here to do something different for ArrayManager vs BlockManager.

But based on the timings that I got / get, I didn't see a strong reason to do so. You get quite some different timings of course that gives a different impression, no idea why it would give such a difference (what version of numpy are you using? the timings I did yesterday are with 1.20 or 1.21 depending on the pandas version)

OK, I somehow got it in my head that doing the .T before the take would keep the result F-ordered, but apparently not. Thanks for taking a look.

@simonjayhawkins a good-ish explanation here would be that the results im getting locally are a wild outlier. Can you give it a try to test this hypothesis?

PERF: get_dummies create array of 0/1s with appropriate layout for in…

d940251

…ternals

jorisvandenbossche added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation labels Mar 8, 2021

jorisvandenbossche requested a review from jreback March 8, 2021 08:55

jreback requested changes Mar 8, 2021

View reviewed changes

jreback added this to the 1.3 milestone Mar 8, 2021

Merge remote-tracking branch 'upstream/master' into am-get-dummies

8573147

jreback approved these changes Mar 15, 2021

View reviewed changes

jreback merged commit 154096f into pandas-dev:master Mar 15, 2021

jorisvandenbossche deleted the am-get-dummies branch March 15, 2021 14:13

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF: get_dummies create array of 0/1s with appropriate layout for in…

9089814

…ternals (pandas-dev#40299)

jbrockmendel reviewed Sep 22, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

jorisvandenbossche commented Mar 8, 2021

jreback Mar 8, 2021

jorisvandenbossche Mar 8, 2021

jreback Mar 15, 2021

jorisvandenbossche commented Mar 15, 2021

jorisvandenbossche commented Mar 15, 2021

simonjayhawkins commented Jul 1, 2021

jbrockmendel Sep 22, 2021

jbrockmendel Oct 5, 2021

jbrockmendel Oct 12, 2021

jorisvandenbossche Oct 13, 2021 •

edited

Loading

jbrockmendel Oct 13, 2021

jorisvandenbossche Oct 13, 2021 •

edited

Loading

jbrockmendel Oct 13, 2021

jorisvandenbossche Oct 14, 2021

jbrockmendel Oct 14, 2021

PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

Conversation

jorisvandenbossche commented Mar 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 15, 2021

jorisvandenbossche commented Mar 15, 2021

simonjayhawkins commented Jul 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Oct 13, 2021 •

edited

Loading

jorisvandenbossche Oct 13, 2021 •

edited

Loading