Skip to content

PERF: refactor string construction benchmark #52410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ngoldbaum
Copy link
Contributor

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

For asv peakmem benchmarks, the memory cost of the setup function is included in the benchmark result. The way this benchmark is currently written, the memory usage is dominated by constructing series_cat_arr, so the memory benchmark results aren't very useful for Series and DataFrame. I also added string[pyarrow] to the benchmark because why not?

The results before the refactor:

[ 8.33%] ··· strings.Construction.peakmem_cat_series_construction
                                ok
[ 8.33%] ··· ======== ======
              dtype         
             -------- ------
               str     329M 
              string   313M 
             ======== ======

[16.67%] ··· strings.Construction.peakmem_frame_construction
                                     ok
[16.67%] ··· ======== ======
              dtype         
             -------- ------
               str     313M 
              string   313M 
             ======== ======

[25.00%] ··· strings.Construction.peakmem_series_construction
                                    ok
[25.00%] ··· ======== ======
              dtype         
             -------- ------
               str     313M 
              string   313M 
             ======== ======

[33.33%] ··· strings.Construction.time_cat_series_construction
                                   ok
[33.33%] ··· ======== ==========
              dtype             
             -------- ----------
               str     213±0ms  
              string   43.1±0ms 
             ======== ==========

[41.67%] ··· strings.Construction.time_frame_construction
                                        ok
[41.67%] ··· ======== ==========
              dtype             
             -------- ----------
               str     37.9±0ms 
              string   54.5±0ms 
             ======== ==========

[50.00%] ··· strings.Construction.time_series_construction
                                       ok
[50.00%] ··· ======== ==========
              dtype             
             -------- ----------
               str     38.1±0ms 
              string   39.5±0ms 
             ======== ==========

And after:

[25.00%] ··· strings.Construction.peakmem_construction
                                           ok
[25.00%] ··· ==================== ====== ================ =================
             --                                     dtype                  
             -------------------- -----------------------------------------
                   pd_type         str    string[python]   string[pyarrow] 
             ==================== ====== ================ =================
                    series         310M        310M              316M      
                    frame          310M        310M              316M      
              categorical_series   327M        313M              319M      
             ==================== ====== ================ =================

[50.00%] ··· strings.Construction.time_construction
                                              ok
[50.00%] ··· ==================== ========== ================ =================
             --                                       dtype                    
             -------------------- ---------------------------------------------
                   pd_type           str      string[python]   string[pyarrow] 
             ==================== ========== ================ =================
                    series         39.3±0ms      37.6±0ms          48.9±0ms    
                    frame          40.0±0ms      41.7±0ms          56.4±0ms    
              categorical_series   215±0ms       44.7±0ms          57.0±0ms    
             ==================== ========== ================ =================

Note the different, lower, peak memory usages. I also find it a bit easier to compare results as two parameterized benchmarks.

@MarcoGorelli MarcoGorelli added the Benchmark Performance (ASV) benchmarks label Apr 4, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngoldbaum for your PR!

Agree that this is much easier to read, good one

Could you please explain why now we need to pass

dtype=self.dtype_mapping[dtype]

? Looks like the default is object, so here this only makes a difference for str dtype - why is it better to pass dtype=str for that one?

)
if pd_type == "series":
self.arr = series_arr
if pd_type == "frame":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: elif

@ngoldbaum
Copy link
Contributor Author

In principle we don’t but IMO it makes the timing benchmarks more fair because it avoids a copy to numpy’s string dtype inside pandas. I have a version of this benchmark with the new string dtype I’ve been working on and avoiding that copy is substantially faster there.

@MarcoGorelli MarcoGorelli added this to the 2.1 milestone Apr 5, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @ngoldbaum !

Leaving open a bit in case others have comments

@mroeschke mroeschke merged commit 8e2746e into pandas-dev:main Apr 6, 2023
@mroeschke
Copy link
Member

Thanks @ngoldbaum

topper-123 pushed a commit to topper-123/pandas that referenced this pull request Apr 6, 2023
* PERF: refactor string construction benchmark

* CLN: respond to review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants