Skip to content

PERF: Updated andrews_curves to use Numpy arrays for its samples #11534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 24, 2015

Conversation

khs26
Copy link

@khs26 khs26 commented Nov 6, 2015

Hello,

I hope I've followed the contribution guidelines correctly, but am happy to change things if necessary.

I noticed that andrews_curves doesn't make use of numpy arrays in what I thought was a sensible use case: for generating its samples.

I added a test which uses variable length random data, so that I could check the timing changes between the numpy and non-numpy versions and found the following (rough data):

Length of data Time / s (non-numpy) Time / s (numpy)
Without test 3.43 3.84
10 4.69 4.21
100 6.58 5.20
1000 20.67 16.41
10000 162.17 125.37

The test adds some overhead (though it is decorated with @slow), so I'm happy to amend the commit and remove it. Otherwise, the changes seem to have resulted in a small speed up, which becomes more important for larger data (my original motivation, since I was trying to do it with a 100k x 5 dataframe).

Thanks,

Kyle

@khs26 khs26 changed the title Updated andrews_curves to use Numpy arrays for its samples PERF: Updated andrews_curves to use Numpy arrays for its samples Nov 6, 2015
@sinhrks sinhrks added the Visualization plotting label Nov 7, 2015
@sinhrks
Copy link
Member

sinhrks commented Nov 7, 2015

lgtm. @TomAugspurger ?

@sinhrks sinhrks added the Performance Memory or execution speed performance label Nov 7, 2015
@TomAugspurger
Copy link
Contributor

Yeah. @khs26 could you add an item to the release notes under enhancements? doc/source/whatsnew/v0.17.0

@jreback jreback added this to the 0.17.1 milestone Nov 7, 2015
@jreback
Copy link
Contributor

jreback commented Nov 7, 2015

@khs26 are those times in seconds?

x1 = amplitudes[0]
result = x1 / sqrt(2.0)
harmonic = 1.0
for x_even, x_odd in zip(amplitudes[1::2], amplitudes[2::2]):
result += (x_even * sin(harmonic * x) +
x_odd * cos(harmonic * x))
result += (x_even * np.sin(harmonic * t) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire loops can be easily vectorized if you want to give that a shot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look at vectorising the loops, but I think it potentially gets problematic with Numpy slices being views and needing to resize the arrays. I didn't want to get a marginal performance increase by making the code way less clear. I'll certainly give it a look, if I have time, though.

Currently writing up my PhD, so need to make sure I keep focusing on the writing. I must say that you guys have made it really easy to contribute, what with being knowledgeable and decent human beings all at the same time. :) Also, the contribution guidelines are excellent.

@khs26
Copy link
Author

khs26 commented Nov 8, 2015

  1. I'll add changes to the release notes.

  2. Yeah, the times are in seconds and rough timing for the whole test (as opposed to just my changes). I'm still getting familiar with nose and testing generally in Python.

@khs26 khs26 force-pushed the numpify-andrews-curves branch from cbaae56 to 1bc1e54 Compare November 8, 2015 17:13
@@ -64,6 +64,7 @@ Performance Improvements
- Improved performance of ``rolling_median`` (:issue:`11450`)

- Improved performance to ``to_excel`` (:issue:`11352`)
- Improved performance of ``andrews_curves``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here (use this pr number as their is not an associated issue)

@jreback
Copy link
Contributor

jreback commented Nov 8, 2015

ok, some comments.

Let's create another issue that shows the performance implications (after this PR), and maybe you (or another brave soul) can address in the future.

@khs26
Copy link
Author

khs26 commented Nov 10, 2015

I actually found a bit of time to fix it up. I'm going to update the branch shortly.

This change didn't really change things hugely. I suspected it would only make a significant difference on wide dataframes (i.e. those with a large number of coefficients). I also think it would be pushing the utility of Andrews plots as a visualisation technique. It's definitely a close thing between a slight performance increase, with slightly lessened readability.

Here are the timings:

(length, width) of data Time / s (non-numpy) Time / s (first vectorisation) Time / s (both vectorisation)
length 10
10, 1 0.04 0.03 0.03
10, 10 0.07 0.06 0.07
10, 100 0.16 0.10 0.12
10, 1000 0.86 0.26 0.20
length 100
100, 1 0.29 0.31 0.25
100, 10 0.47 0.32 0.29
100, 100 1.06 0.55 0.42
100, 1000 7.74 1.59 1.22
length 1000
1000, 1 1.88 1.70 1.41
1000, 10 2.87 1.44 2.06
1000, 100 9.83 3.07 2.33
1000, 1000 75.11 13.48 10.19
length 10000
10000, 1 3.31 2.20 2.00
10000, 10 4.68 2.29 2.48
10000, 100 16.94 4.22 4.06
10000, 1000 149.49 25.25 19.45

@khs26
Copy link
Author

khs26 commented Nov 10, 2015

I've left it as a separate commit for the moment, but will squash it, if we're happy with it.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

can you add an asv benchmark for this? http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

otherwise looks good.

@@ -560,14 +575,14 @@ def f(x):
for i in range(n):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way to get a massive speedup, is do this ALL in a vectorized way, e.g.
pass in the entire numpy array (df.values), and do the calculation, then plot in a loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, of course. I'll do that and the asv benchmark. Will probably be over the weekend.

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 18, 2015
@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

@khs26 can you rebase / update

@khs26 khs26 force-pushed the numpify-andrews-curves branch 2 times, most recently from 5368ae9 to b61bca0 Compare November 22, 2015 19:53
@khs26
Copy link
Author

khs26 commented Nov 22, 2015

Sorry that took a while, been quite busy. I ran a profile on it, and it turns out that the existing changes make the slowest part be drawing in matplotlib (with the same/lower scaling).

I added an asv benchmark as well. Hopefully I got it right, it seems to run ok and I copied the basic template from timeseries plotting in benchmarks/plotting.py.

@khs26
Copy link
Author

khs26 commented Nov 22, 2015

Also rebased the commit on top of the latest master, so it should be good to merge.

@@ -146,6 +146,7 @@ Performance Improvements
- Performance improvement in ``Categorical.remove_unused_categories``, (:issue:`11643`).
- Improved performance of ``Series`` constructor with no data and ``DatetimeIndex`` (:issue:`11433`)
- Improved performance of ``shift``, ``cumprod``, and ``cumsum`` with groupby (:issue:`4095`)
- Improved performance of ``andrews_curves`` (:issue:`11534`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move to 0.18.0 file

@jreback jreback modified the milestones: 0.18.0, Next Major Release Nov 22, 2015
DOC: Added some documentation to andrews_curves
TST: Added a variable length test to TestDataFramePlots.test_andrews_curves
@khs26 khs26 force-pushed the numpify-andrews-curves branch from b61bca0 to f3cb813 Compare November 22, 2015 23:29
@jreback
Copy link
Contributor

jreback commented Nov 23, 2015

looks good. ping on green.

@khs26
Copy link
Author

khs26 commented Nov 23, 2015

Travis is happy and I'm gonna leave it be now, so can merge whenever you want.

jreback added a commit that referenced this pull request Nov 24, 2015
PERF: Updated andrews_curves to use Numpy arrays for its samples
@jreback jreback merged commit 5bc191a into pandas-dev:master Nov 24, 2015
@jreback
Copy link
Contributor

jreback commented Nov 24, 2015

thank you sir!

@khs26 khs26 deleted the numpify-andrews-curves branch November 24, 2015 14:55
jreback added a commit that referenced this pull request Nov 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Visualization plotting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants