-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1014,7 +1014,8 @@ def get_empty_frame(data) -> DataFrame: | |||
return out | |||
|
|||
else: | |||
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0) | |||
# take on axis=1 + transpose to ensure ndarray layout is column-major |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the order paramater here: https://numpy.org/doc/stable/reference/generated/numpy.eye.html or maybe .reshape (as .take gives a copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: the take
is needed for the actual get_dummies operation.
But given your suggestion, I thought that specifying order
in eye
might speed up the take
call, but from a quick test that doesn't seem to matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess this is fine
@jreback any further feedback here? |
Thanks! |
|
@@ -1039,7 +1039,8 @@ def get_empty_frame(data) -> DataFrame: | |||
return out | |||
|
|||
else: | |||
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0) | |||
# take on axis=1 + transpose to ensure ndarray layout is column-major | |||
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=1).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche investigating a perf regression in reshape.GetDummies.time_get_dummies_1d
current best guess is this line. Would doing arr.T.take(codes, axis=0)
be equivalent here? That's about 7x faster for me locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche gentle ping; time_get_dummies_1d is the biggest perf regression im seeing vs 1.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche gentle ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to the performance regression? Is it something that showed up in the online ASV bencharks? (or just the one that Simon mentioned above?)
Note that I mentioned above that this will actually cause a slowdown for BlockManager if you isolate the get_dummies, so some performance regression was anticipated (see the timings in the top post, which has both a timing for only get_dummies, and a timing for get_dummies + copy). As I noted back then, it could also special case this to do it only for ArrayManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same one simon linked, yes, but I see a much worse deterioriation locally (likely measuring from a different baseline):
from asv_bench.benchmarks.reshape import *
setup()
self = GetDummies()
self.setup()
%timeit self.time_get_dummies_1d()
5.79 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- 1.2.0
43.1 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master
profiling with %prun -s cumtime for n in range(100): self.time_get_dummies_1d()
traces essentially the whole change back to this .take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see:
In [2]: %timeit pd.get_dummies(s, sparse=False)
4.22 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.2.5
12.6 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.3.3
12.7 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- master
In [3]: %timeit pd.get_dummies(s, sparse=False).copy()
14.7 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.2.5
13.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.3.3
13.2 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- master
(s
is defined in a snippet in the top post, but it's the same as the ASV benchmark you did)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does using arr.T.take(codes, axis=0)
affect these for you? Locally making that change does make the .copy slower, but by a much smaller margin than the get_dummies becomes faster
In [3]: %timeit pd.get_dummies(s, sparse=False)
43 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master
6.36 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- transpose-before-take
In [4]: %timeit pd.get_dummies(s, sparse=False).copy()
47.6 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master
15.7 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- transpose-before-take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does using arr.T.take(codes, axis=0) affect these for you?
See the explanation in the top post of this PR. The intent (I didn't check again in detail to be clear, just rereading what I wrote back then) was to ensure that the data was in the best memory layout for pandas' internals.
This was more important for the ArrayManager (because there it always needs to be contiguous by column, while for a 2D block we can store it either way. But even for 2D blocks, some operations will be faster afterwards if the layout is column-major (i.e. row-major in the transposed way how we store it in the Block)). That's also why I mentioned that we could do a if/else
here to do something different for ArrayManager vs BlockManager.
But based on the timings that I got / get, I didn't see a strong reason to do so. You get quite some different timings of course that gives a different impression, no idea why it would give such a difference (what version of numpy are you using? the timings I did yesterday are with 1.20 or 1.21 depending on the pandas version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I somehow got it in my head that doing the .T
before the take
would keep the result F-ordered, but apparently not. Thanks for taking a look.
@simonjayhawkins a good-ish explanation here would be that the results im getting locally are a wild outlier. Can you give it a try to test this hypothesis?
This ensures that the 2D ndarray that is created in
get_dummies
for a single column is directly in the "right" layout as the internals prefer it (column-major, so it's row-major in the transposed Block.values).For ArrayManager, this gives a speed-up for
get_dummies
itself (because we still copy the arrays to ensure proper contiguous memory layout for each column, currently):For the current BlockManager, it makes the actual get_dummies a bit slower, but can make a follow-up operation actually faster (where a copy, or consolidation, or a reduction, or .. would happen):
The timing is from the following example taken from the
frame_methods.py::GetDummies
ASV case:So it makes the actual
get_dummies
a bit slower for the BlockManager, and thus I could also explicitly only do this for the ArrayManager. But since it improves follow-up operations, I think it can actually be beneficial overall for BlockManager as well. But either is fine for me.