PERF: refactor string construction benchmark #52410

ngoldbaum · 2023-04-04T16:04:24Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

For asv peakmem benchmarks, the memory cost of the setup function is included in the benchmark result. The way this benchmark is currently written, the memory usage is dominated by constructing series_cat_arr, so the memory benchmark results aren't very useful for Series and DataFrame. I also added string[pyarrow] to the benchmark because why not?

The results before the refactor:

[ 8.33%] ··· strings.Construction.peakmem_cat_series_construction
                                ok
[ 8.33%] ··· ======== ======
              dtype         
             -------- ------
               str     329M 
              string   313M 
             ======== ======

[16.67%] ··· strings.Construction.peakmem_frame_construction
                                     ok
[16.67%] ··· ======== ======
              dtype         
             -------- ------
               str     313M 
              string   313M 
             ======== ======

[25.00%] ··· strings.Construction.peakmem_series_construction
                                    ok
[25.00%] ··· ======== ======
              dtype         
             -------- ------
               str     313M 
              string   313M 
             ======== ======

[33.33%] ··· strings.Construction.time_cat_series_construction
                                   ok
[33.33%] ··· ======== ==========
              dtype             
             -------- ----------
               str     213±0ms  
              string   43.1±0ms 
             ======== ==========

[41.67%] ··· strings.Construction.time_frame_construction
                                        ok
[41.67%] ··· ======== ==========
              dtype             
             -------- ----------
               str     37.9±0ms 
              string   54.5±0ms 
             ======== ==========

[50.00%] ··· strings.Construction.time_series_construction
                                       ok
[50.00%] ··· ======== ==========
              dtype             
             -------- ----------
               str     38.1±0ms 
              string   39.5±0ms 
             ======== ==========

And after:

[25.00%] ··· strings.Construction.peakmem_construction
                                           ok
[25.00%] ··· ==================== ====== ================ =================
             --                                     dtype                  
             -------------------- -----------------------------------------
                   pd_type         str    string[python]   string[pyarrow] 
             ==================== ====== ================ =================
                    series         310M        310M              316M      
                    frame          310M        310M              316M      
              categorical_series   327M        313M              319M      
             ==================== ====== ================ =================

[50.00%] ··· strings.Construction.time_construction
                                              ok
[50.00%] ··· ==================== ========== ================ =================
             --                                       dtype                    
             -------------------- ---------------------------------------------
                   pd_type           str      string[python]   string[pyarrow] 
             ==================== ========== ================ =================
                    series         39.3±0ms      37.6±0ms          48.9±0ms    
                    frame          40.0±0ms      41.7±0ms          56.4±0ms    
              categorical_series   215±0ms       44.7±0ms          57.0±0ms    
             ==================== ========== ================ =================

Note the different, lower, peak memory usages. I also find it a bit easier to compare results as two parameterized benchmarks.

MarcoGorelli

Thanks @ngoldbaum for your PR!

Agree that this is much easier to read, good one

Could you please explain why now we need to pass

dtype=self.dtype_mapping[dtype]

? Looks like the default is object, so here this only makes a difference for str dtype - why is it better to pass dtype=str for that one?

MarcoGorelli · 2023-04-04T16:20:07Z

asv_bench/benchmarks/strings.py

+        )
+        if pd_type == "series":
+            self.arr = series_arr
+        if pd_type == "frame":


nit: elif

ngoldbaum · 2023-04-04T16:30:52Z

In principle we don’t but IMO it makes the timing benchmarks more fair because it avoids a copy to numpy’s string dtype inside pandas. I have a version of this benchmark with the new string dtype I’ve been working on and avoiding that copy is substantially faster there.

MarcoGorelli

Looks good, thanks @ngoldbaum !

Leaving open a bit in case others have comments

mroeschke · 2023-04-06T16:32:06Z

Thanks @ngoldbaum

* PERF: refactor string construction benchmark * CLN: respond to review comments

PERF: refactor string construction benchmark

e4cb3a9

MarcoGorelli added the Benchmark Performance (ASV) benchmarks label Apr 4, 2023

MarcoGorelli suggested changes Apr 4, 2023

View reviewed changes

asv_bench/benchmarks/strings.py Outdated

)

if pd_type == "series":

self.arr = series_arr

if pd_type == "frame":

Copy link

Member

MarcoGorelli Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: elif

CLN: respond to review comments

47f9a32

MarcoGorelli added this to the 2.1 milestone Apr 5, 2023

MarcoGorelli approved these changes Apr 5, 2023

View reviewed changes

mroeschke approved these changes Apr 6, 2023

View reviewed changes

mroeschke merged commit 8e2746e into pandas-dev:main Apr 6, 2023

topper-123 pushed a commit to topper-123/pandas that referenced this pull request Apr 6, 2023

PERF: refactor string construction benchmark (pandas-dev#52410)

cc363c2

* PERF: refactor string construction benchmark * CLN: respond to review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: refactor string construction benchmark #52410

PERF: refactor string construction benchmark #52410

ngoldbaum commented Apr 4, 2023

MarcoGorelli left a comment

MarcoGorelli Apr 4, 2023

ngoldbaum commented Apr 4, 2023

MarcoGorelli left a comment

mroeschke commented Apr 6, 2023

PERF: refactor string construction benchmark #52410

PERF: refactor string construction benchmark #52410

Conversation

ngoldbaum commented Apr 4, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Apr 4, 2023

Choose a reason for hiding this comment

ngoldbaum commented Apr 4, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

mroeschke commented Apr 6, 2023