Fix randomly failing test in test_frame.py #9225

qwhelan · 2015-01-11T09:44:43Z

@jreback I noticed a randomly failing test while doing a full suite run a week or two ago. The test is using random integers as part of a MultiIndex, which randomly leads to a non-unique index. Attempting to re-index the DataFrame therefore randomly fails. The solution is to just sample without replacement.

The following code gives a failure rate of ~1.5%, which is in line with birthday problem estimates (d=1000 and four iterations each of n=3 and n=2):

import subprocess

total = 0
n = 10
for i in range(n):
    result = subprocess.call('nosetests pandas/tests/test_frame.py:TestDataFrame.test_reindex_level', shell=True)
    if result:
        print i
        total += 1
print 'Total error rate of {0:.6f}'.format(total/float(n))

jreback · 2015-01-11T14:10:14Z

I concur

cc @behzadnouri look right to you?

behzadnouri · 2015-01-11T16:08:14Z

the whole resulting index (the combination of the first 3 columns) should be unique, not a single level. see test example below. (the probability of non-unique would then be lower)

I would recommend to change the PR to this: (git diff here)

while True:
    df = DataFrame({'jim':['mid'] * 5 + ['btm'] * 8 + ['top'] * 7,
                    'joe':['3rd'] * 2 + ['1st'] * 3 + ['2nd'] * 3 +
                          ['1st'] * 2 + ['3rd'] * 3 + ['1st'] * 2 +
                          ['3rd'] * 3 + ['2nd'] * 2,
                    'jolie':np.random.randint(0, 1000, 20),
                    'joline': np.random.randn(20).round(3) * 10})

    if not df.duplicated(subset=icol).any():
        break

for _ in range(2):
    for idx in permutations(df['jim'].unique()):
        for i in range(3):
            verify_first_level(df, 'jim', idx[:i+1])

    i = [2,3,4,0,1,8,9,5,6,7,10,11,12,13,14,18,19,15,16,17]
    verify(df, 'joe', ['1st', '2nd', '3rd'], i)

    i = [0,1,2,3,4,10,11,12,5,6,7,8,9,15,16,17,18,19,13,14]
    verify(df, 'joe', ['3rd', '2nd', '1st'], i)

    i = [0,1,5,6,7,10,11,12,18,19,15,16,17]
    verify(df, 'joe', ['2nd', '3rd'], i)

    i = [0,1,2,3,4,10,11,12,8,9,15,16,17,13,14]
    verify(df, 'joe', ['3rd', '1st'], i)

    # re-do the tests with deliberate duplicates @ last level
    df['jolie'] = np.arange(len(df)) % 3

this way it would be a more general test than forcing a single level to be all unique.

example that the tests pass as long as df.set_index(['jim', 'joe', 'jolie']).index.is_unique:

In [51]: df
Out[51]: 
    jim  joe  jolie  joline
0   mid  3rd      0    2.25
1   mid  3rd      1   -6.73
2   mid  1st      2  -16.37
3   mid  1st      0   -7.11
4   mid  1st      1    5.45
5   btm  2nd      2  -10.76
6   btm  2nd      0   -1.02
7   btm  2nd      1    6.10
8   btm  1st      2    3.21
9   btm  1st      0   19.98
10  btm  3rd      1   16.15
11  btm  3rd      2    8.90
12  btm  3rd      0    1.39
13  top  1st      1   -4.39
14  top  1st      2   14.37
15  top  3rd      0  -18.69
16  top  3rd      1  -22.98
17  top  3rd      2    2.39
18  top  2nd      0   -5.13
19  top  2nd      1  -12.82

In [52]: for idx in permutations(df['jim'].unique()):
   ....:         for i in range(3):
   ....:                 verify_first_level(df, 'jim', idx[:i+1])
   ....:                                                                 

In [53]: i = [2,3,4,0,1,8,9,5,6,7,10,11,12,13,14,18,19,15,16,17]

In [54]: verify(df, 'joe', ['1st', '2nd', '3rd'], i)

In [55]: i = [0,1,2,3,4,10,11,12,5,6,7,8,9,15,16,17,18,19,13,14]

In [56]: verify(df, 'joe', ['3rd', '2nd', '1st'], i)

In [57]: i = [0,1,5,6,7,10,11,12,18,19,15,16,17]

In [58]: verify(df, 'joe', ['2nd', '3rd'], i)

In [59]: i = [0,1,2,3,4,10,11,12,8,9,15,16,17,13,14]

In [60]: verify(df, 'joe', ['3rd', '1st'], i)

qwhelan · 2015-01-11T19:40:05Z

@behzadnouri I'd prefer to more explicitly fix this rather than just add a while True: block. Also, it appears that the deliberate duplicates case is already handled in the previous batch of checks in this test https://github.com/pydata/pandas/pull/9225/files#diff-e4d49868a05be9ee2167607d80eb699dL1918

My new version of the patch enforces the uniqueness constraint by just doing multiple np.random.choice samples of the appropriate size.

shoyer · 2015-01-12T02:04:54Z

I agree with @qwhelan. Using while True: to avoid problematic random numbers for a unit test feels pretty hacky to me.

A simpler fix (probably worth doing in any case) would be do set the random number seed for the test. I usually recommend using np.random.RandomState objects instead of the global functions for precisely this reason. Note that numpy guarantees consistency in the random numbers produced for a particular seed as part of their API (even between versions).

behzadnouri · 2015-01-12T02:58:25Z

@shoyer the point of randomized tests is to discover corner cases that one may not think of. other-wise one would just provide a hand-written test case. by pre-fixing the random seed you lose that advantage.

while True: in python, is the same as do { ... } while( condition ) in C, C++. current pandas source code uses while True: in many places.

qwhelan · 2015-01-12T03:13:20Z

@shoyer Normally I'd agree but we're only sampling 20 integers here - might as well hardcode that rather than set a seed.

@behzadnouri The issue is not the notation but the lack of clarity into the what the acceptable data state is when the block is exited (and not having an upper-bound on runtime). My current patch directly produces your desired data.

jreback · 2015-01-15T03:00:54Z

@qwhelan this looks good. ready to go?

qwhelan · 2015-01-15T03:19:31Z

@jreback Yep.

Fix randomly failing test in test_frame.py

jreback · 2015-01-15T03:25:21Z

thanks!

jreback added the Testing pandas testing functions or related to the test suite label Jan 11, 2015

jreback added this to the 0.16.0 milestone Jan 11, 2015

Fix randomly failing test in test_frame.py

b6564dc

qwhelan force-pushed the master branch from 0ea802c to b6564dc Compare January 11, 2015 19:36

jreback added a commit that referenced this pull request Jan 15, 2015

Merge pull request #9225 from qwhelan/master

779fbaa

Fix randomly failing test in test_frame.py

jreback merged commit 779fbaa into pandas-dev:master Jan 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix randomly failing test in test_frame.py #9225

Fix randomly failing test in test_frame.py #9225

qwhelan commented Jan 11, 2015

jreback commented Jan 11, 2015

behzadnouri commented Jan 11, 2015

qwhelan commented Jan 11, 2015

shoyer commented Jan 12, 2015

behzadnouri commented Jan 12, 2015

qwhelan commented Jan 12, 2015

jreback commented Jan 15, 2015

qwhelan commented Jan 15, 2015

jreback commented Jan 15, 2015

Fix randomly failing test in test_frame.py #9225

Fix randomly failing test in test_frame.py #9225

Conversation

qwhelan commented Jan 11, 2015

jreback commented Jan 11, 2015

behzadnouri commented Jan 11, 2015

qwhelan commented Jan 11, 2015

shoyer commented Jan 12, 2015

behzadnouri commented Jan 12, 2015

qwhelan commented Jan 12, 2015

jreback commented Jan 15, 2015

qwhelan commented Jan 15, 2015

jreback commented Jan 15, 2015