Skip to content

PERF: Using _Period for optimization #32973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 30, 2020

Conversation

ShaharNaveh
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I did two bench marks one in a tight loop(%timeit), and one with asv.

I don't see a significant performance change (but I'm not sure I read the output correctly), maybe worth just removing the "TODO" note.


Benchmarks:

Timeit:
In [1]: import pandas as pd                                                                                   

In [2]: idx = pd.period_range("2000-01-01", periods=3)                                                        

In [3]: %timeit idx.get_loc(idx[1], "pad") 
337 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # Master
338 µs ± 7.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # PR
ASV (period.Indexing)
       before           after         ratio
     [f20331d5]       [f71bf172]
     <master>         <PERF-index-_Period>
      1.46±0.02ms      1.45±0.03ms     0.99  period.Indexing.time_align
       8.57±0.2μs       8.51±0.2μs     0.99  period.Indexing.time_get_loc
         512±30μs          506±3μs     0.99  period.Indexing.time_intersection
         30.6±2μs       30.8±0.3μs     1.01  period.Indexing.time_series_loc
       3.95±0.3μs      3.96±0.02μs     1.00  period.Indexing.time_shallow_copy
       87.9±0.5μs       88.5±0.6μs     1.01  period.Indexing.time_unique

@@ -468,9 +467,8 @@ cdef class PeriodEngine(Int64Engine):
cdef int64_t _unbox_scalar(self, scalar) except? -1:
if scalar is NaT:
return scalar.value
if isinstance(scalar, Period):
if isinstance(scalar, periodlib._Period):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would need to cimport _Period to get the performance improvement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I don't do anything stupid, I need to create the file period.pxd at pandas/_libs/tslibs/, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@jreback jreback added Performance Memory or execution speed performance Period Period data type labels Mar 24, 2020
@ShaharNaveh
Copy link
Member Author

ASV benchmarks:

All benchmarks:

       before           after         ratio
     [2ef8fd1b]       [5f7a6ebc]
     <master>         <PERF-index-_Period>
      1.53±0.03ms      1.47±0.02ms     0.97  period.Indexing.time_align
       8.72±0.2μs      8.40±0.06μs     0.96  period.Indexing.time_get_loc
         545±20μs         502±10μs     0.92  period.Indexing.time_intersection
       31.5±0.7μs         32.3±2μs     1.03  period.Indexing.time_series_loc
       4.19±0.2μs      4.15±0.05μs     0.99  period.Indexing.time_shallow_copy
         93.6±4μs       88.5±0.3μs     0.95  period.Indexing.time_unique

@jreback
Copy link
Contributor

jreback commented Mar 26, 2020

ok so not much improvements. @jbrockmendel

@jreback jreback added this to the 1.1 milestone Mar 26, 2020
@jbrockmendel
Copy link
Member

@MomIsBestFriend since the improvements are very small, two questions to ask:

  1. are those asv results accurate/precise? i.e. are they consistent across runs? get_loc is them only one id really expect to be affected
  2. is the build size or affected?

@ShaharNaveh
Copy link
Member Author

1. are those asv results accurate/precise?  i.e. are they consistent across runs?  `get_loc` is them only one id really expect to be affected

Yes the asv results are in fact consistent across runs, I have ran the period.Indexing.time_get_loc benchmark 5 times and it the results range is between 8.00 - 9.00 μs (for bothmaster and this PR.

2. is the build size or affected?

@jbrockmendel What do you mean by "build size"?

@jbrockmendel
Copy link
Member

and it the results range is between 8.00 - 9.00 μs (for bothmaster and this PR.

Are the "ratio" numbers consistent? I'm looking to be able to make a concrete statement to the effect of "this makes a small but non-zero improvement in X"

What do you mean by "build size"?

Roughly i mean the size in bytes of the wheel we end up distributing. Could alternatively measure the memory footprint of the imported modules, but thats trickier.

@WillAyd
Copy link
Member

WillAyd commented Mar 27, 2020

Not strongly against this, but a slight preference to just remove the TODO; this doesn't seem to have a material impact so I don't think worth any added complexity

@ShaharNaveh
Copy link
Member Author

and it the results range is between 8.00 - 9.00 μs (for bothmaster and this PR.

Are the "ratio" numbers consistent? I'm looking to be able to make a concrete statement to the effect of "this makes a small but non-zero improvement in X"

Yes the numbers are consistent, unfortunately you can't say that there is an improvement.

What do you mean by "build size"?

Roughly i mean the size in bytes of the wheel we end up distributing. Could alternatively measure the memory footprint of the imported modules, but thats trickier.

There is a size difference which is only a 4 bytes (PR is smaller)

@ShaharNaveh
Copy link
Member Author

remove the TODO; this doesn't seem to have a material impact so I don't think worth any added complexity

Agreed

@jreback
Copy link
Contributor

jreback commented Mar 29, 2020

remove the TODO; this doesn't seem to have a material impact so I don't think worth any added complexity

Agreed

agree here, @MomIsBestFriend if you can just remove the TODO and drop the other changes can accept this PR.

@jreback jreback merged commit 0d46144 into pandas-dev:master Mar 30, 2020
@jreback
Copy link
Contributor

jreback commented Mar 30, 2020

thanks @MomIsBestFriend

@ShaharNaveh ShaharNaveh deleted the PERF-index-_Period branch March 31, 2020 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants