-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: refactor string construction benchmark #52410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: refactor string construction benchmark #52410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ngoldbaum for your PR!
Agree that this is much easier to read, good one
Could you please explain why now we need to pass
dtype=self.dtype_mapping[dtype]
? Looks like the default is object
, so here this only makes a difference for str
dtype - why is it better to pass dtype=str
for that one?
asv_bench/benchmarks/strings.py
Outdated
) | ||
if pd_type == "series": | ||
self.arr = series_arr | ||
if pd_type == "frame": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: elif
In principle we don’t but IMO it makes the timing benchmarks more fair because it avoids a copy to numpy’s string dtype inside pandas. I have a version of this benchmark with the new string dtype I’ve been working on and avoiding that copy is substantially faster there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @ngoldbaum !
Leaving open a bit in case others have comments
Thanks @ngoldbaum |
* PERF: refactor string construction benchmark * CLN: respond to review comments
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.For
asv
peakmem
benchmarks, the memory cost of the setup function is included in the benchmark result. The way this benchmark is currently written, the memory usage is dominated by constructingseries_cat_arr
, so the memory benchmark results aren't very useful forSeries
andDataFrame
. I also addedstring[pyarrow]
to the benchmark because why not?The results before the refactor:
And after:
Note the different, lower, peak memory usages. I also find it a bit easier to compare results as two parameterized benchmarks.