Skip to content

TST/PERF: make a test input creater instead of np.random.randn mainly for perf testing #3114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Mar 21, 2013 · 31 comments
Labels
Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite

Comments

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

to allow easy testing of c-contig/f-contig and long frames etc

see #3089
http://mail.python.org/pipermail/pandas-dev/2013-March/000008.html

@ghost
Copy link

ghost commented Mar 21, 2013

I expected np.array(order='C') , np.array(order='F') to work, it doesn't.

@jreback
Copy link
Contributor Author

jreback commented Mar 21, 2013

I think you have to do

np.array(np.random.rand(10,2),order='C')

@stephenwlin
Copy link
Contributor

we should also exercise non-contiguous cases, too, because those result from slicing larger contiguous arrays.

@jreback
Copy link
Contributor Author

jreback commented Mar 21, 2013

absolutely, if you have time, maybe put together of a list of things we shoudl test (on perf basis) where these things matter?

@stephenwlin
Copy link
Contributor

well, that's all i can think of now (obviously I didn't think of this before it came up as a perf issue, otherwise I would have tested it already), but I'll make note if I think of any more

@ghost
Copy link

ghost commented Mar 21, 2013

It looks like the perf regression was 50% - but on a 1ms test.
Just wondering what scale of data are we talking about here where
this would seriously impact users?

@stephenwlin
Copy link
Contributor

@y-p take is a primitive operation for a lot of other operations though...I think the test suite is not representative of real world usage in general (and performance is definitely noticeable for large data sets...like >1GB, which I've actually had to deal with for personal usage already) so it's very hard to tell what matters and what doesn't; it's like reading tea leaves

@ghost
Copy link

ghost commented Mar 21, 2013

of course, I guess i'm trying to understand what regression we're talking about, since
jeff's original vbench that prompted this only showed a couple of reindex_foo outliers.

Basically, you've landed a big optimization by taking array order into account,
it has nothing to do with the original regression. did I get that right?

@stephenwlin
Copy link
Contributor

do you mean the latest PR, #3130? that does address the regression in reindex_frame_level_align and reindex_frame_level_reindex, by turning off the original optimization for small row/column sizes, and it happens to optimize other cases as well (by allowing the optimization in cases where it could be used but isn't) but I don't know how well those are tested by the suite...we should move comments on this over to #3130 if that's what you're talking about (the regression was only 25% on my machine, due to 32-bit/64-bit differences, so that PR undoes it)

@stephenwlin
Copy link
Contributor

I'm trying to test a further optimization refinement (described in the FINAL EDIT portion of this SO question) and I'm getting some extreme variability in my results (with or without the new BUILD_CACHE_DIR, although it seems to be worse with it...still not sure if there's something going on with that or if it's just a fluke).

I've talked to some performance experts in the C++ SO chatroom about this and the consensus seems to be that randomly generating a set of indices once but reusing them over and over for multiple runs of the same test (as is the case for some, but not all, of the suite tests...I believe this is the case at least... can anyone confirm that the setup portions of the tests are only executed once rather than repeatedly?) is leading to inconsistent results from run to run because of the effect of the layout on caching and alignment issues.

To get more consistent results, I think we need to do one of the following:

  • Always generate new random indices per iteration of each test so that the variability gets averaged out
  • Use sets of indices hardcoded into the test suite for each test

The first is probably better but it means that the time of the random index generation is part of the results of the test, which is non-ideal. The second might lead to less coverage in case hardcoded indices don't adequately cover real use cases.

Not 100% sure how much of an issue this is, but from the results I've been seeing it seems like this is the most likely culprit...still investigating (a little bit hard because of test turnaround time).

@ghost
Copy link

ghost commented Mar 23, 2013

Does vb actually do multiple iterations and averaging? I didn't know.

So you're suggesting that adding the default random seed to test_perf will actually
make variability worse?

@ghost
Copy link

ghost commented Mar 23, 2013

If you can post a script example for me to verify that what you suggests eliminates variation
on my machine as well, that'd be great.

@stephenwlin
Copy link
Contributor

@y-p I don't know either about vb_suite...I'm guessing the setup section is done once and all the multiple iterations are of the statement passed as the first argument to Benchmark():

i.e.

groupby_first = Benchmark('data.groupby(labels).first()', setup,
                          start_date=datetime(2012, 5, 1))

calls setup once and then iterates 'data.groupby(labels).first()' repeatedly.

In this case, setup creates a random permutation of labels:

labels = labels.take(np.random.permutation(len(labels)))

but the same labels are (presumably) used repeatedly so any performance effect of structural features of the permutation will be present in every iteration rather than averaged out.

As per your statement

So you're suggesting that adding the default random seed to test_perf will actually
make variability worse?

I think I'm suggesting the opposite actually (unless I'm misunderstanding you)...if we fix the seed (it's not fixed right now, is it?), which I didn't actually consider, then every run of vb_suite will be using the same random numbers so we'll eliminate the effect of variability entirely in theory (but possibly the test will be non-representative by accident). Or we could go the opposite extreme, and make every single iteration of each test variable, so the variability hopefully averages out. What doesn't work is the middle ground where we non-deterministically generate a random set of numbers once and reuse them over and over (which I presume is what's happening right now).

@stephenwlin
Copy link
Contributor

@y-p I'm replying here instead of on closed PR

I've added and pushed a random seed option to test_perf. grepping for "seed"
in the vbench repo came up empty so that ought to have taken care of it.
(It seeds python random as well numpy.random).
all runs now use a default seed , if not specified explicitly.

I'm still seeing the same variability i've always seen with vb.

You mean fixing the seed so that every run of vb_suite generates the same values, right? Are you sure it's working so that the same values are generated every time (i.e. it's picking up the seed correctly and not being reseeded ron-deterministically elsewhere?)

It's possibly only be relevant for this low-level thing I'm doing right now, since data alignment makes a big difference in the performance in this case. The approach is non-ideal regardless, anyway (assuming I'm correct about how setup is used.)

@stephenwlin
Copy link
Contributor

(Rather than just conjecturing more, I'm going to try running vb_suite overnight repeatedly to test this systematically with and without various changes to tests; it might take some time for me to get enough data to draw firm conclusions...I'll report back if/when I do...)

@ghost
Copy link

ghost commented Mar 23, 2013

then I will too.

re repeatable seed making variability better not worse, good I thought so too.
I tthink repeatability is more important in general, you can do multiple runs
with diferent seeds to gather data on the distribution of performance.

re averaging: You've actually tested this on a specific test and got what you
expected?.... and yep. just measure is best.

@ghost
Copy link

ghost commented Mar 23, 2013

be aware of the changes in test_perf, with the seeding, make sure you're running the commit
with the behaviour you intend.

@stephenwlin
Copy link
Contributor

did you push a change to test_perf already? EDIT Never mind, I just saw it...did you verify that it works as expected though? Same values at runtime every time?

@ghost
Copy link

ghost commented Mar 23, 2013

re: setting up the seed. as I wrote I found no other calls to seed within vbench,
and I printf-verified that both np and python return deterministic data right after the seed call.

yes I did, that's what I noted in my comment you quoted above.

@ghost
Copy link

ghost commented Mar 23, 2013

if random data produces that much variation in results, you need to keep the
data constant to test for regressions, so I made that the default behaviour.

test_perf.sh -s myseed(default 1234)

@stephenwlin
Copy link
Contributor

OK, great, I'll try overnighting a bunch of runs with fixed seed and a bunch of runs with different seeds and see how much variation I get. It'll help me narrow this down. (Not sure why I didn't think of putting a fixed seed originally, that's clearly easier than rewriting the tests :D)

@ghost
Copy link

ghost commented Mar 23, 2013

that's ok, you can do cache stripe calculations in your head, that's good enough 8)

@ghost
Copy link

ghost commented Mar 23, 2013

fyi, i did 3 runs on a small number of vbs, each time, the value of the random seed after
the tests produced the same set of random values when calling immediately after the tests.

I think that clearly proves the random data is now deterministically random. If that's not an oxymoron.

@stephenwlin
Copy link
Contributor

ok, I definitely verified that my latest refinement leads to variable results, but it's not reduced by setting the random seed deterministically, so it means that the Python interpreter itself has non-deterministic behavior that is relevant for performance (probably in allocation patterns):

here results with fixed seeds:

frame_reindex_axis0                         0.6189     1.8109     0.3418
frame_reindex_axis0                         1.0079     1.2519     0.8051
frame_reindex_axis0                         0.5227     0.6118     0.8543
frame_reindex_axis0                         0.4115     0.4681     0.8792
frame_reindex_axis0                         0.4141     0.4454     0.9296
frame_reindex_axis0                         0.5567     0.5934     0.9382
frame_reindex_axis0                         0.4374     0.4661     0.9385
frame_reindex_axis0                         0.4723     0.5015     0.9417
frame_reindex_axis0                         0.4407     0.4583     0.9616
frame_reindex_axis0                         0.4676     0.4773     0.9797
frame_reindex_axis0                         0.4757     0.4827     0.9855
frame_reindex_axis0                         0.4918     0.4968     0.9898
frame_reindex_axis0                         0.4964     0.4973     0.9982
frame_reindex_axis0                         0.4925     0.4791     1.0279
frame_reindex_axis0                         0.4543     0.4272     1.0635
frame_reindex_axis0                         0.5132     0.4814     1.0662
frame_reindex_axis0                         0.5277     0.4898     1.0774
frame_reindex_axis0                         0.4869     0.4518     1.0778
frame_reindex_axis0                         0.4904     0.4527     1.0832
frame_reindex_axis0                         0.5184     0.4653     1.1142
frame_reindex_axis0                         0.5918     0.4751     1.2455
frame_reindex_axis0                         0.7440     0.5269     1.4119
frame_reindex_axis0                         1.4274     0.5206     2.7421

and with varying seeds:

frame_reindex_axis0                         0.4789     1.0887     0.4399
frame_reindex_axis0                         0.4191     0.9355     0.4480
frame_reindex_axis0                         0.5033     0.7374     0.6825
frame_reindex_axis0                         1.0038     1.4614     0.6868
frame_reindex_axis0                         0.4026     0.5564     0.7235
frame_reindex_axis0                         0.4721     0.5202     0.9076
frame_reindex_axis0                         0.4517     0.4972     0.9086
frame_reindex_axis0                         0.4463     0.4827     0.9247
frame_reindex_axis0                         0.4874     0.5116     0.9526
frame_reindex_axis0                         0.5077     0.5123     0.9909
frame_reindex_axis0                         0.5102     0.5129     0.9947
frame_reindex_axis0                         0.4499     0.4490     1.0019
frame_reindex_axis0                         0.4609     0.4550     1.0130
frame_reindex_axis0                         0.4857     0.4787     1.0147
frame_reindex_axis0                         0.5212     0.4665     1.1172
frame_reindex_axis0                         0.5331     0.4667     1.1422
frame_reindex_axis0                         0.5396     0.4514     1.1953
frame_reindex_axis0                         0.5368     0.4407     1.2181
frame_reindex_axis0                         0.8344     0.5309     1.5717
frame_reindex_axis0                         1.4105     0.8847     1.5943
frame_reindex_axis0                         1.4070     0.4603     3.0565
frame_reindex_axis0                         1.4141     0.4546     3.1107
frame_reindex_axis0                         1.4211     0.4163     3.4140

This non-determinism is probably already present, just magnified by this particular change (which is weird, because the change doesn't seem to really do much...)

@ghost
Copy link

ghost commented Mar 23, 2013

I've seen python variability with %timeit many times.
jeff and I worked together on the csv improvements and 10% error was very common,
although that's IO so perhaps another matter.

I'd suggest that the fact that these are sub-ms tests is playing a role,
but there was one test in the /reindex/ set thet took 25 seconds, and it varied
up to 32 seconds in repeated runs on my box.

I take it you withdraw your FUD comments about the cython build cache? ;)

@stephenwlin
Copy link
Contributor

I take it you withdraw your FUD comments about the cython build cache? ;)

well, to be honest, I can't be sure since all of this is with the cache on...I doubt that's an issue but I just mean that this doesn't provide any info about that actually. I can try repeating everything with the cache off if we want to be sure.

@stephenwlin
Copy link
Contributor

my change is literally what I describe in the last section of the SO question, though...so it seems like it'd be benign: all I'm doing taking advantage of the known strides to write a loop that uses those strides at compile-time rather than a generic one that evaluates the strides at run-time...possibly numpy is allocating arrays with different alignments randomly

@ghost
Copy link

ghost commented Mar 23, 2013

You could measure variability for each vbench seperately, and see if large variance
clumps around certain families or if it's a general thing.
It'll probably yield at least a clue.

@stephenwlin
Copy link
Contributor

looked through numpy list prior dicussions about this, and turns out there's no guarantee of alignment beyond the minimum required (which for me appears to always be 32-bit even for 64-bit doubles in my case...that's what's probably causing the problem)...apparently you can overallocate a bit (with a 1-d array), manually get a view of an aligned subarray and set the strides yourself it you need it...memory is cheap enough that an extra 96 bits per row + 96 bits total at the beginning (worst possible case) can't make much of a difference...probably is that you'd have to change the allocation everywhere where it's relevant...will split this off to a new issue

@ghost
Copy link

ghost commented Mar 23, 2013

alignment can yield big wins. most def.

need to validate that this is what's actually happening though.

@WillAyd
Copy link
Member

WillAyd commented Jul 6, 2018

I believe this is no longer relevant with ASV benchmarks but feel free to reopen if you disagree

@WillAyd WillAyd closed this as completed Jul 6, 2018
@WillAyd WillAyd modified the milestones: Someday, Contributions Welcome, No action Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

3 participants