PERF: Updated andrews_curves to use Numpy arrays for its samples #11534

khs26 · 2015-11-06T17:32:41Z

Hello,

I hope I've followed the contribution guidelines correctly, but am happy to change things if necessary.

I noticed that andrews_curves doesn't make use of numpy arrays in what I thought was a sensible use case: for generating its samples.

I added a test which uses variable length random data, so that I could check the timing changes between the numpy and non-numpy versions and found the following (rough data):

Length of data	Time / s (non-numpy)	Time / s (numpy)
Without test	3.43	3.84
10	4.69	4.21
100	6.58	5.20
1000	20.67	16.41
10000	162.17	125.37

The test adds some overhead (though it is decorated with @slow), so I'm happy to amend the commit and remove it. Otherwise, the changes seem to have resulted in a small speed up, which becomes more important for larger data (my original motivation, since I was trying to do it with a 100k x 5 dataframe).

Thanks,

Kyle

sinhrks · 2015-11-07T00:22:29Z

lgtm. @TomAugspurger ?

TomAugspurger · 2015-11-07T00:28:40Z

Yeah. @khs26 could you add an item to the release notes under enhancements? doc/source/whatsnew/v0.17.0

jreback · 2015-11-07T14:36:58Z

@khs26 are those times in seconds?

jreback · 2015-11-07T14:38:35Z

pandas/tools/plotting.py

            x1 = amplitudes[0]
            result = x1 / sqrt(2.0)
            harmonic = 1.0
            for x_even, x_odd in zip(amplitudes[1::2], amplitudes[2::2]):
-                result += (x_even * sin(harmonic * x) +
-                           x_odd * cos(harmonic * x))
+                result += (x_even * np.sin(harmonic * t) +


this entire loops can be easily vectorized if you want to give that a shot

I did look at vectorising the loops, but I think it potentially gets problematic with Numpy slices being views and needing to resize the arrays. I didn't want to get a marginal performance increase by making the code way less clear. I'll certainly give it a look, if I have time, though.

Currently writing up my PhD, so need to make sure I keep focusing on the writing. I must say that you guys have made it really easy to contribute, what with being knowledgeable and decent human beings all at the same time. :) Also, the contribution guidelines are excellent.

khs26 · 2015-11-08T11:37:36Z

I'll add changes to the release notes.
Yeah, the times are in seconds and rough timing for the whole test (as opposed to just my changes). I'm still getting familiar with nose and testing generally in Python.

jreback · 2015-11-08T18:44:01Z

doc/source/whatsnew/v0.17.1.txt

@@ -64,6 +64,7 @@ Performance Improvements
 - Improved performance of ``rolling_median`` (:issue:`11450`)

 - Improved performance to ``to_excel`` (:issue:`11352`)
+- Improved performance of ``andrews_curves``


add the issue number here (use this pr number as their is not an associated issue)

jreback · 2015-11-08T18:44:56Z

ok, some comments.

Let's create another issue that shows the performance implications (after this PR), and maybe you (or another brave soul) can address in the future.

khs26 · 2015-11-10T15:21:54Z

I actually found a bit of time to fix it up. I'm going to update the branch shortly.

This change didn't really change things hugely. I suspected it would only make a significant difference on wide dataframes (i.e. those with a large number of coefficients). I also think it would be pushing the utility of Andrews plots as a visualisation technique. It's definitely a close thing between a slight performance increase, with slightly lessened readability.

Here are the timings:

(length, width) of data	Time / s (non-numpy)	Time / s (first vectorisation)	Time / s (both vectorisation)
length 10
10, 1	0.04	0.03	0.03
10, 10	0.07	0.06	0.07
10, 100	0.16	0.10	0.12
10, 1000	0.86	0.26	0.20
length 100
100, 1	0.29	0.31	0.25
100, 10	0.47	0.32	0.29
100, 100	1.06	0.55	0.42
100, 1000	7.74	1.59	1.22
length 1000
1000, 1	1.88	1.70	1.41
1000, 10	2.87	1.44	2.06
1000, 100	9.83	3.07	2.33
1000, 1000	75.11	13.48	10.19
length 10000
10000, 1	3.31	2.20	2.00
10000, 10	4.68	2.29	2.48
10000, 100	16.94	4.22	4.06
10000, 1000	149.49	25.25	19.45

khs26 · 2015-11-10T15:25:49Z

I've left it as a separate commit for the moment, but will squash it, if we're happy with it.

jreback · 2015-11-13T13:58:40Z

can you add an asv benchmark for this? http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

otherwise looks good.

jreback · 2015-11-13T14:02:50Z

pandas/tools/plotting.py

@@ -560,14 +575,14 @@ def f(x):
    for i in range(n):


the way to get a massive speedup, is do this ALL in a vectorized way, e.g.
pass in the entire numpy array (df.values), and do the calculation, then plot in a loop

Oh, of course. I'll do that and the asv benchmark. Will probably be over the weekend.

jreback · 2015-11-18T11:56:14Z

@khs26 can you rebase / update

khs26 · 2015-11-22T19:56:47Z

Sorry that took a while, been quite busy. I ran a profile on it, and it turns out that the existing changes make the slowest part be drawing in matplotlib (with the same/lower scaling).

I added an asv benchmark as well. Hopefully I got it right, it seems to run ok and I copied the basic template from timeseries plotting in benchmarks/plotting.py.

khs26 · 2015-11-22T19:57:09Z

Also rebased the commit on top of the latest master, so it should be good to merge.

jreback · 2015-11-22T20:06:04Z

doc/source/whatsnew/v0.17.1.txt

@@ -146,6 +146,7 @@ Performance Improvements
 - Performance improvement in ``Categorical.remove_unused_categories``, (:issue:`11643`).
 - Improved performance of ``Series`` constructor with no data and ``DatetimeIndex`` (:issue:`11433`)
 - Improved performance of ``shift``, ``cumprod``, and ``cumsum`` with groupby (:issue:`4095`)
+- Improved performance of ``andrews_curves`` (:issue:`11534`)


pls move to 0.18.0 file

DOC: Added some documentation to andrews_curves TST: Added a variable length test to TestDataFramePlots.test_andrews_curves

jreback · 2015-11-23T00:06:39Z

looks good. ping on green.

khs26 · 2015-11-23T13:02:40Z

Travis is happy and I'm gonna leave it be now, so can merge whenever you want.

PERF: Updated andrews_curves to use Numpy arrays for its samples

jreback · 2015-11-24T13:11:24Z

thank you sir!

khs26 changed the title ~~Updated andrews_curves to use Numpy arrays for its samples~~ PERF: Updated andrews_curves to use Numpy arrays for its samples Nov 6, 2015

sinhrks added the Visualization plotting label Nov 7, 2015

sinhrks added the Performance Memory or execution speed performance label Nov 7, 2015

jreback added this to the 0.17.1 milestone Nov 7, 2015

jreback reviewed Nov 7, 2015
View reviewed changes

khs26 force-pushed the numpify-andrews-curves branch from cbaae56 to 1bc1e54 Compare November 8, 2015 17:13

jreback reviewed Nov 8, 2015
View reviewed changes

khs26 force-pushed the numpify-andrews-curves branch from 1bc1e54 to 019e7f0 Compare November 8, 2015 21:33

khs26 mentioned this pull request Nov 8, 2015

PERF: Use numpy arrays in andrews_curves plots #11554

Closed

jreback reviewed Nov 13, 2015
View reviewed changes

jreback modified the milestones: Next Major Release, 0.17.1 Nov 18, 2015

khs26 force-pushed the numpify-andrews-curves branch 2 times, most recently from 5368ae9 to b61bca0 Compare November 22, 2015 19:53

jreback reviewed Nov 22, 2015
View reviewed changes

jreback modified the milestones: 0.18.0, Next Major Release Nov 22, 2015

PERF: Updated andrews_curves to use a Numpy array for samples

f3cb813

DOC: Added some documentation to andrews_curves TST: Added a variable length test to TestDataFramePlots.test_andrews_curves

khs26 force-pushed the numpify-andrews-curves branch from b61bca0 to f3cb813 Compare November 22, 2015 23:29

jreback added a commit that referenced this pull request Nov 24, 2015

Merge pull request #11534 from khs26/numpify-andrews-curves

5bc191a

PERF: Updated andrews_curves to use Numpy arrays for its samples

jreback merged commit 5bc191a into pandas-dev:master Nov 24, 2015

khs26 deleted the numpify-andrews-curves branch November 24, 2015 14:55

jreback added a commit that referenced this pull request Nov 29, 2015

COMPAT: remove warnings from #11534

d9e679a

Uh oh!

PERF: Updated andrews_curves to use Numpy arrays for its samples #11534

PERF: Updated andrews_curves to use Numpy arrays for its samples #11534

Uh oh!

Conversation

khs26 commented Nov 6, 2015

Uh oh!

sinhrks commented Nov 7, 2015

Uh oh!

TomAugspurger commented Nov 7, 2015

Uh oh!

jreback commented Nov 7, 2015

Uh oh!

jreback Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

khs26 Nov 8, 2015

Choose a reason for hiding this comment

Uh oh!

khs26 commented Nov 8, 2015

Uh oh!

jreback Nov 8, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 8, 2015

Uh oh!

khs26 commented Nov 10, 2015

Uh oh!

khs26 commented Nov 10, 2015

Uh oh!

jreback commented Nov 13, 2015

Uh oh!

jreback Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

khs26 Nov 13, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 18, 2015

Uh oh!

khs26 commented Nov 22, 2015

Uh oh!

khs26 commented Nov 22, 2015

Uh oh!

jreback Nov 22, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 23, 2015

Uh oh!

khs26 commented Nov 23, 2015

Uh oh!

jreback commented Nov 24, 2015

Uh oh!

Uh oh!