-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Updated andrews_curves to use Numpy arrays for its samples #11534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lgtm. @TomAugspurger ? |
Yeah. @khs26 could you add an item to the release notes under enhancements? |
@khs26 are those times in seconds? |
x1 = amplitudes[0] | ||
result = x1 / sqrt(2.0) | ||
harmonic = 1.0 | ||
for x_even, x_odd in zip(amplitudes[1::2], amplitudes[2::2]): | ||
result += (x_even * sin(harmonic * x) + | ||
x_odd * cos(harmonic * x)) | ||
result += (x_even * np.sin(harmonic * t) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this entire loops can be easily vectorized if you want to give that a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look at vectorising the loops, but I think it potentially gets problematic with Numpy slices being views and needing to resize the arrays. I didn't want to get a marginal performance increase by making the code way less clear. I'll certainly give it a look, if I have time, though.
Currently writing up my PhD, so need to make sure I keep focusing on the writing. I must say that you guys have made it really easy to contribute, what with being knowledgeable and decent human beings all at the same time. :) Also, the contribution guidelines are excellent.
|
cbaae56
to
1bc1e54
Compare
@@ -64,6 +64,7 @@ Performance Improvements | |||
- Improved performance of ``rolling_median`` (:issue:`11450`) | |||
|
|||
- Improved performance to ``to_excel`` (:issue:`11352`) | |||
- Improved performance of ``andrews_curves`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here (use this pr number as their is not an associated issue)
ok, some comments. Let's create another issue that shows the performance implications (after this PR), and maybe you (or another brave soul) can address in the future. |
1bc1e54
to
019e7f0
Compare
I actually found a bit of time to fix it up. I'm going to update the branch shortly. This change didn't really change things hugely. I suspected it would only make a significant difference on wide dataframes (i.e. those with a large number of coefficients). I also think it would be pushing the utility of Andrews plots as a visualisation technique. It's definitely a close thing between a slight performance increase, with slightly lessened readability. Here are the timings:
|
I've left it as a separate commit for the moment, but will squash it, if we're happy with it. |
can you add an asv benchmark for this? http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite otherwise looks good. |
@@ -560,14 +575,14 @@ def f(x): | |||
for i in range(n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way to get a massive speedup, is do this ALL in a vectorized way, e.g.
pass in the entire numpy array (df.values
), and do the calculation, then plot in a loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, of course. I'll do that and the asv
benchmark. Will probably be over the weekend.
@khs26 can you rebase / update |
5368ae9
to
b61bca0
Compare
Sorry that took a while, been quite busy. I ran a profile on it, and it turns out that the existing changes make the slowest part be drawing in I added an |
Also rebased the commit on top of the latest master, so it should be good to merge. |
@@ -146,6 +146,7 @@ Performance Improvements | |||
- Performance improvement in ``Categorical.remove_unused_categories``, (:issue:`11643`). | |||
- Improved performance of ``Series`` constructor with no data and ``DatetimeIndex`` (:issue:`11433`) | |||
- Improved performance of ``shift``, ``cumprod``, and ``cumsum`` with groupby (:issue:`4095`) | |||
- Improved performance of ``andrews_curves`` (:issue:`11534`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move to 0.18.0 file
DOC: Added some documentation to andrews_curves TST: Added a variable length test to TestDataFramePlots.test_andrews_curves
b61bca0
to
f3cb813
Compare
looks good. ping on green. |
Travis is happy and I'm gonna leave it be now, so can merge whenever you want. |
PERF: Updated andrews_curves to use Numpy arrays for its samples
thank you sir! |
Hello,
I hope I've followed the contribution guidelines correctly, but am happy to change things if necessary.
I noticed that andrews_curves doesn't make use of numpy arrays in what I thought was a sensible use case: for generating its samples.
I added a test which uses variable length random data, so that I could check the timing changes between the numpy and non-numpy versions and found the following (rough data):
The test adds some overhead (though it is decorated with
@slow
), so I'm happy to amend the commit and remove it. Otherwise, the changes seem to have resulted in a small speed up, which becomes more important for larger data (my original motivation, since I was trying to do it with a 100k x 5 dataframe).Thanks,
Kyle