Skip to content

PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 15, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -1039,7 +1039,8 @@ def get_empty_frame(data) -> DataFrame:
return out

else:
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
# take on axis=1 + transpose to ensure ndarray layout is column-major
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the order paramater here: https://numpy.org/doc/stable/reference/generated/numpy.eye.html or maybe .reshape (as .take gives a copy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: the take is needed for the actual get_dummies operation.
But given your suggestion, I thought that specifying order in eye might speed up the take call, but from a quick test that doesn't seem to matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess this is fine

dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=1).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche investigating a perf regression in reshape.GetDummies.time_get_dummies_1d current best guess is this line. Would doing arr.T.take(codes, axis=0) be equivalent here? That's about 7x faster for me locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche gentle ping; time_get_dummies_1d is the biggest perf regression im seeing vs 1.2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche gentle ping

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to the performance regression? Is it something that showed up in the online ASV bencharks? (or just the one that Simon mentioned above?)

Note that I mentioned above that this will actually cause a slowdown for BlockManager if you isolate the get_dummies, so some performance regression was anticipated (see the timings in the top post, which has both a timing for only get_dummies, and a timing for get_dummies + copy). As I noted back then, it could also special case this to do it only for ArrayManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same one simon linked, yes, but I see a much worse deterioriation locally (likely measuring from a different baseline):

from asv_bench.benchmarks.reshape import *
setup()
self = GetDummies()
self.setup()

%timeit self.time_get_dummies_1d()
5.79 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- 1.2.0
43.1 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- master

profiling with %prun -s cumtime for n in range(100): self.time_get_dummies_1d() traces essentially the whole change back to this .take

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see:

In [2]: %timeit pd.get_dummies(s, sparse=False)
4.22 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.2.5
12.6 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.3.3
12.7 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

In [3]: %timeit pd.get_dummies(s, sparse=False).copy()
14.7 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.2.5
13.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- 1.3.3
13.2 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

(s is defined in a snippet in the top post, but it's the same as the ASV benchmark you did)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does using arr.T.take(codes, axis=0) affect these for you? Locally making that change does make the .copy slower, but by a much smaller margin than the get_dummies becomes faster

In [3]: %timeit pd.get_dummies(s, sparse=False)
43 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- master
6.36 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- transpose-before-take


In [4]: %timeit pd.get_dummies(s, sparse=False).copy()
47.6 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- master
15.7 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- transpose-before-take

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does using arr.T.take(codes, axis=0) affect these for you?

See the explanation in the top post of this PR. The intent (I didn't check again in detail to be clear, just rereading what I wrote back then) was to ensure that the data was in the best memory layout for pandas' internals.
This was more important for the ArrayManager (because there it always needs to be contiguous by column, while for a 2D block we can store it either way. But even for 2D blocks, some operations will be faster afterwards if the layout is column-major (i.e. row-major in the transposed way how we store it in the Block)). That's also why I mentioned that we could do a if/else here to do something different for ArrayManager vs BlockManager.

But based on the timings that I got / get, I didn't see a strong reason to do so. You get quite some different timings of course that gives a different impression, no idea why it would give such a difference (what version of numpy are you using? the timings I did yesterday are with 1.20 or 1.21 depending on the pandas version)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I somehow got it in my head that doing the .T before the take would keep the result F-ordered, but apparently not. Thanks for taking a look.

@simonjayhawkins a good-ish explanation here would be that the results im getting locally are a wild outlier. Can you give it a try to test this hypothesis?


if not dummy_na:
# reset NaN GH4446
Expand Down