-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: get_dummies create array of 0/1s with appropriate layout for internals #40299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1039,7 +1039,8 @@ def get_empty_frame(data) -> DataFrame: | |
return out | ||
|
||
else: | ||
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0) | ||
# take on axis=1 + transpose to ensure ndarray layout is column-major | ||
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=1).T | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche investigating a perf regression in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche gentle ping; time_get_dummies_1d is the biggest perf regression im seeing vs 1.2.0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche gentle ping There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you point to the performance regression? Is it something that showed up in the online ASV bencharks? (or just the one that Simon mentioned above?) Note that I mentioned above that this will actually cause a slowdown for BlockManager if you isolate the get_dummies, so some performance regression was anticipated (see the timings in the top post, which has both a timing for only get_dummies, and a timing for get_dummies + copy). As I noted back then, it could also special case this to do it only for ArrayManager. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the same one simon linked, yes, but I see a much worse deterioriation locally (likely measuring from a different baseline):
profiling with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see: In [2]: %timeit pd.get_dummies(s, sparse=False)
4.22 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.2.5
12.6 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.3.3
12.7 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- master
In [3]: %timeit pd.get_dummies(s, sparse=False).copy()
14.7 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.2.5
13.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- 1.3.3
13.2 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- master ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does using
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
See the explanation in the top post of this PR. The intent (I didn't check again in detail to be clear, just rereading what I wrote back then) was to ensure that the data was in the best memory layout for pandas' internals. But based on the timings that I got / get, I didn't see a strong reason to do so. You get quite some different timings of course that gives a different impression, no idea why it would give such a difference (what version of numpy are you using? the timings I did yesterday are with 1.20 or 1.21 depending on the pandas version) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I somehow got it in my head that doing the @simonjayhawkins a good-ish explanation here would be that the results im getting locally are a wild outlier. Can you give it a try to test this hypothesis? |
||
|
||
if not dummy_na: | ||
# reset NaN GH4446 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the order paramater here: https://numpy.org/doc/stable/reference/generated/numpy.eye.html or maybe .reshape (as .take gives a copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: the
take
is needed for the actual get_dummies operation.But given your suggestion, I thought that specifying
order
ineye
might speed up thetake
call, but from a quick test that doesn't seem to matterThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess this is fine