Skip to content

Fix randomly failing test in test_frame.py #9225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 15, 2015
Merged

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Jan 11, 2015

@jreback I noticed a randomly failing test while doing a full suite run a week or two ago. The test is using random integers as part of a MultiIndex, which randomly leads to a non-unique index. Attempting to re-index the DataFrame therefore randomly fails. The solution is to just sample without replacement.

The following code gives a failure rate of ~1.5%, which is in line with birthday problem estimates (d=1000 and four iterations each of n=3 and n=2):

import subprocess

total = 0
n = 10
for i in range(n):
    result = subprocess.call('nosetests pandas/tests/test_frame.py:TestDataFrame.test_reindex_level', shell=True)
    if result:
        print i
        total += 1
print 'Total error rate of {0:.6f}'.format(total/float(n))

@jreback
Copy link
Contributor

jreback commented Jan 11, 2015

I concur

cc @behzadnouri look right to you?

@jreback jreback added the Testing pandas testing functions or related to the test suite label Jan 11, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 11, 2015
@behzadnouri
Copy link
Contributor

the whole resulting index (the combination of the first 3 columns) should be unique, not a single level. see test example below. (the probability of non-unique would then be lower)

I would recommend to change the PR to this: (git diff here)

while True:
    df = DataFrame({'jim':['mid'] * 5 + ['btm'] * 8 + ['top'] * 7,
                    'joe':['3rd'] * 2 + ['1st'] * 3 + ['2nd'] * 3 +
                          ['1st'] * 2 + ['3rd'] * 3 + ['1st'] * 2 +
                          ['3rd'] * 3 + ['2nd'] * 2,
                    'jolie':np.random.randint(0, 1000, 20),
                    'joline': np.random.randn(20).round(3) * 10})

    if not df.duplicated(subset=icol).any():
        break

for _ in range(2):
    for idx in permutations(df['jim'].unique()):
        for i in range(3):
            verify_first_level(df, 'jim', idx[:i+1])

    i = [2,3,4,0,1,8,9,5,6,7,10,11,12,13,14,18,19,15,16,17]
    verify(df, 'joe', ['1st', '2nd', '3rd'], i)

    i = [0,1,2,3,4,10,11,12,5,6,7,8,9,15,16,17,18,19,13,14]
    verify(df, 'joe', ['3rd', '2nd', '1st'], i)

    i = [0,1,5,6,7,10,11,12,18,19,15,16,17]
    verify(df, 'joe', ['2nd', '3rd'], i)

    i = [0,1,2,3,4,10,11,12,8,9,15,16,17,13,14]
    verify(df, 'joe', ['3rd', '1st'], i)

    # re-do the tests with deliberate duplicates @ last level
    df['jolie'] = np.arange(len(df)) % 3

this way it would be a more general test than forcing a single level to be all unique.


example that the tests pass as long as df.set_index(['jim', 'joe', 'jolie']).index.is_unique:

In [51]: df
Out[51]: 
    jim  joe  jolie  joline
0   mid  3rd      0    2.25
1   mid  3rd      1   -6.73
2   mid  1st      2  -16.37
3   mid  1st      0   -7.11
4   mid  1st      1    5.45
5   btm  2nd      2  -10.76
6   btm  2nd      0   -1.02
7   btm  2nd      1    6.10
8   btm  1st      2    3.21
9   btm  1st      0   19.98
10  btm  3rd      1   16.15
11  btm  3rd      2    8.90
12  btm  3rd      0    1.39
13  top  1st      1   -4.39
14  top  1st      2   14.37
15  top  3rd      0  -18.69
16  top  3rd      1  -22.98
17  top  3rd      2    2.39
18  top  2nd      0   -5.13
19  top  2nd      1  -12.82

In [52]: for idx in permutations(df['jim'].unique()):
   ....:         for i in range(3):
   ....:                 verify_first_level(df, 'jim', idx[:i+1])
   ....:                                                                 

In [53]: i = [2,3,4,0,1,8,9,5,6,7,10,11,12,13,14,18,19,15,16,17]

In [54]: verify(df, 'joe', ['1st', '2nd', '3rd'], i)

In [55]: i = [0,1,2,3,4,10,11,12,5,6,7,8,9,15,16,17,18,19,13,14]

In [56]: verify(df, 'joe', ['3rd', '2nd', '1st'], i)

In [57]: i = [0,1,5,6,7,10,11,12,18,19,15,16,17]

In [58]: verify(df, 'joe', ['2nd', '3rd'], i)

In [59]: i = [0,1,2,3,4,10,11,12,8,9,15,16,17,13,14]

In [60]: verify(df, 'joe', ['3rd', '1st'], i)

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 11, 2015

@behzadnouri I'd prefer to more explicitly fix this rather than just add a while True: block. Also, it appears that the deliberate duplicates case is already handled in the previous batch of checks in this test https://github.com/pydata/pandas/pull/9225/files#diff-e4d49868a05be9ee2167607d80eb699dL1918

My new version of the patch enforces the uniqueness constraint by just doing multiple np.random.choice samples of the appropriate size.

@shoyer
Copy link
Member

shoyer commented Jan 12, 2015

I agree with @qwhelan. Using while True: to avoid problematic random numbers for a unit test feels pretty hacky to me.

A simpler fix (probably worth doing in any case) would be do set the random number seed for the test. I usually recommend using np.random.RandomState objects instead of the global functions for precisely this reason. Note that numpy guarantees consistency in the random numbers produced for a particular seed as part of their API (even between versions).

@behzadnouri
Copy link
Contributor

@shoyer the point of randomized tests is to discover corner cases that one may not think of. other-wise one would just provide a hand-written test case. by pre-fixing the random seed you lose that advantage.

while True: in python, is the same as do { ... } while( condition ) in C, C++. current pandas source code uses while True: in many places.

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 12, 2015

@shoyer Normally I'd agree but we're only sampling 20 integers here - might as well hardcode that rather than set a seed.

@behzadnouri The issue is not the notation but the lack of clarity into the what the acceptable data state is when the block is exited (and not having an upper-bound on runtime). My current patch directly produces your desired data.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2015

@qwhelan this looks good. ready to go?

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 15, 2015

@jreback Yep.

jreback added a commit that referenced this pull request Jan 15, 2015
Fix randomly failing test in test_frame.py
@jreback jreback merged commit 779fbaa into pandas-dev:master Jan 15, 2015
@jreback
Copy link
Contributor

jreback commented Jan 15, 2015

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants