Skip to content

PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 15, 2021

Conversation

jorisvandenbossche
Copy link
Member

This ensures that the 2D ndarray that is created in get_dummies for a single column is directly in the "right" layout as the internals prefer it (column-major, so it's row-major in the transposed Block.values).

For ArrayManager, this gives a speed-up for get_dummies itself (because we still copy the arrays to ensure proper contiguous memory layout for each column, currently):

In [7]: %timeit pd.get_dummies(s, sparse=False)
23.2 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
16.6 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

For the current BlockManager, it makes the actual get_dummies a bit slower, but can make a follow-up operation actually faster (where a copy, or consolidation, or a reduction, or .. would happen):

In [4]: %timeit pd.get_dummies(s, sparse=False)
10.7 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- master
14.2 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

In [5]: %timeit pd.get_dummies(s, sparse=False).copy()
22.3 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
17.8 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

The timing is from the following example taken from the frame_methods.py::GetDummies ASV case:

import string

categories = list(string.ascii_letters[:12])
s = pd.Series(
    np.random.choice(categories, size=1000000),
    dtype=pd.CategoricalDtype(categories),
)

So it makes the actual get_dummies a bit slower for the BlockManager, and thus I could also explicitly only do this for the ArrayManager. But since it improves follow-up operations, I think it can actually be beneficial overall for BlockManager as well. But either is fine for me.

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Internals Related to non-user accessible pandas implementation labels Mar 8, 2021
@@ -1014,7 +1014,8 @@ def get_empty_frame(data) -> DataFrame:
return out

else:
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
# take on axis=1 + transpose to ensure ndarray layout is column-major
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the order paramater here: https://numpy.org/doc/stable/reference/generated/numpy.eye.html or maybe .reshape (as .take gives a copy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: the take is needed for the actual get_dummies operation.
But given your suggestion, I thought that specifying order in eye might speed up the take call, but from a quick test that doesn't seem to matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess this is fine

@jreback jreback added this to the 1.3 milestone Mar 8, 2021
@jorisvandenbossche
Copy link
Member Author

@jreback any further feedback here?

@jreback jreback merged commit 154096f into pandas-dev:master Mar 15, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-get-dummies branch March 15, 2021 14:13
@jorisvandenbossche
Copy link
Member Author

Thanks!

@simonjayhawkins
Copy link
Member

So it makes the actual get_dummies a bit slower for the BlockManager, and thus I could also explicitly only do this for the ArrayManager. But since it improves follow-up operations, I think it can actually be beneficial overall for BlockManager as well. But either is fine for me.

https://pandas.pydata.org/speed/pandas/#reshape.GetDummies.time_get_dummies_1d?python=3.8&Cython=0.29.21&commits=b8a71f33-154096f2

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
@@ -1039,7 +1039,8 @@ def get_empty_frame(data) -> DataFrame:
return out

else:
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
# take on axis=1 + transpose to ensure ndarray layout is column-major
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=1).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche investigating a perf regression in reshape.GetDummies.time_get_dummies_1d current best guess is this line. Would doing arr.T.take(codes, axis=0) be equivalent here? That's about 7x faster for me locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche gentle ping; time_get_dummies_1d is the biggest perf regression im seeing vs 1.2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche gentle ping

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to the performance regression? Is it something that showed up in the online ASV bencharks? (or just the one that Simon mentioned above?)

Note that I mentioned above that this will actually cause a slowdown for BlockManager if you isolate the get_dummies, so some performance regression was anticipated (see the timings in the top post, which has both a timing for only get_dummies, and a timing for get_dummies + copy). As I noted back then, it could also special case this to do it only for ArrayManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same one simon linked, yes, but I see a much worse deterioriation locally (likely measuring from a different baseline):

from asv_bench.benchmarks.reshape import *
setup()
self = GetDummies()
self.setup()

%timeit self.time_get_dummies_1d()
5.79 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- 1.2.0
43.1 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- master

profiling with %prun -s cumtime for n in range(100): self.time_get_dummies_1d() traces essentially the whole change back to this .take

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see:

In [2]: %timeit pd.get_dummies(s, sparse=False)
4.22 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.2.5
12.6 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.3.3
12.7 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

In [3]: %timeit pd.get_dummies(s, sparse=False).copy()
14.7 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.2.5
13.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.3.3
13.2 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

(s is defined in a snippet in the top post, but it's the same as the ASV benchmark you did)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does using arr.T.take(codes, axis=0) affect these for you? Locally making that change does make the .copy slower, but by a much smaller margin than the get_dummies becomes faster

In [3]: %timeit pd.get_dummies(s, sparse=False)
43 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- master
6.36 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- transpose-before-take


In [4]: %timeit pd.get_dummies(s, sparse=False).copy()
47.6 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- master
15.7 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- transpose-before-take

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does using arr.T.take(codes, axis=0) affect these for you?

See the explanation in the top post of this PR. The intent (I didn't check again in detail to be clear, just rereading what I wrote back then) was to ensure that the data was in the best memory layout for pandas' internals.
This was more important for the ArrayManager (because there it always needs to be contiguous by column, while for a 2D block we can store it either way. But even for 2D blocks, some operations will be faster afterwards if the layout is column-major (i.e. row-major in the transposed way how we store it in the Block)). That's also why I mentioned that we could do a if/else here to do something different for ArrayManager vs BlockManager.

But based on the timings that I got / get, I didn't see a strong reason to do so. You get quite some different timings of course that gives a different impression, no idea why it would give such a difference (what version of numpy are you using? the timings I did yesterday are with 1.20 or 1.21 depending on the pandas version)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I somehow got it in my head that doing the .T before the take would keep the result F-ordered, but apparently not. Thanks for taking a look.

@simonjayhawkins a good-ish explanation here would be that the results im getting locally are a wild outlier. Can you give it a try to test this hypothesis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants