Skip to content

TST: Test sorting levels not aligned with index (#25775) #26492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 5, 2019

Conversation

mahepe
Copy link
Contributor

@mahepe mahepe commented May 22, 2019

With master at 6d2398a, issue #25775 seems resolved. Added validation tests.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite labels May 22, 2019
Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple comments on simplifying the implementation.

@pytest.mark.parametrize('seed', list(range(100)))
def test_sort_multi_index(self, seed):
# GH 25775
pd.np.random.seed(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a single deterministic example is probably easier to read and maintain.

I've found that the following example reproduces the issue on 0.23.0:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.23.0'

In [2]: df = pd.DataFrame({'a': [0.4, 0.1, 0.3, 0.2], 'b': [0, 1, 1, 1], 'c': range(4), 'd': list('abcd')})

In [3]: df.set_index(list('abc')).sort_index(level=list('ba'))
Out[3]:
         d
a   b c
0.4 0 0  a
0.1 1 1  b
0.3 1 2  c
0.2 1 3  d

And works on master:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+596.g20d0ad159a'

In [2]: df = pd.DataFrame({'a': [0.4, 0.1, 0.3, 0.2], 'b': [0, 1, 1, 1], 'c': range(4), 'd': list('abcd')})

In [3]: df.set_index(list('abc')).sort_index(level=list('ba'))
Out[3]:
         d
a   b c
0.4 0 0  a
0.1 1 1  b
0.2 1 3  d
0.3 1 2  c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I updated to a single example. I'm using the one given in #25775.

def test_sort_multi_index(self, seed):
# GH 25775
pd.np.random.seed(seed)
df1 = (pd.DataFrame(pd.np.random.rand(10, 5),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame is already imported, so you don't need the pd. prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #26492 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26492      +/-   ##
==========================================
- Coverage   91.88%   91.87%   -0.01%     
==========================================
  Files         174      174              
  Lines       50692    50692              
==========================================
- Hits        46576    46571       -5     
- Misses       4116     4121       +5
Flag Coverage Δ
#multiple 90.41% <ø> (ø) ⬆️
#single 41.78% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.81% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d124ea...5eb692b. Read the comment docs.

@mahepe mahepe force-pushed the validation-test-for-sorting branch from 7efc4b8 to 58d8782 Compare May 25, 2019 06:16
@pep8speaks
Copy link

pep8speaks commented May 25, 2019

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-03 18:32:17 UTC

@mahepe mahepe force-pushed the validation-test-for-sorting branch 4 times, most recently from a89216a to 2a3bf94 Compare May 26, 2019 12:37
@@ -227,6 +228,30 @@ def test_stable_descending_multicolumn_sort(self):
kind='mergesort')
assert_frame_equal(sorted_df, expected)

def test_sort_multi_index(self):
# GH 25775
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not clear by this example what you are even trying to test here. can you add some comments on the test and make the example a little more fluid (maybe add some comments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to the simple example by the other reviewer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what this is testing, example is nice

'5,0.639,1,0.944,0.521,0.414' + \
'6,0.264,7,0.456,0.568,0.018' + \
'7,0.617,6,0.616,0.943,0.681' + \
'8,0.359,4,0.697,0.060,0.666' + \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are using a fixed string here rather than a generated frame with a seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reviewer requested "a single deterministic example" in contrast to iterating through a range of seeds. I wasn't sure what they exactly meant by "deterministic" i.e. whether fixing a seed is good enough or should the dataframe be explicitly given

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant a single explicit DataFrame that minimally reproduces the error. Setting a random seed and the generating random floats had nothing to do with the actual error in question, so IMO it made it less obvious as to what's going on. It also resulted in there be a lot more data than necessary to reproduce the error, which makes it harder to follow what's being tested.

So basically something like:

df = DataFrame({'a': [3, 1, 2], 'b': [0, 0, 0],
                'c': [0, 1, 2], 'd': list('abc')})
result = df.set_index(list('abc')).sort_index(level=list('ba'))

expected = DataFrame({'a': [1, 2, 3], 'b': [0, 0, 0],
                      'c': [1, 2, 0], 'd': list('bca')})
expected = expected.set_index(list('abc'))

tm.assert_frame_equal(result, expected)

Note that df as defined above does indeed not sort correctly on 0.23.4:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.23.4'

In [2]: df = pd.DataFrame({'a' : [3, 1, 2], 'b': [0, 0, 0,], 'c': [0, 1, 2], 'd': list('abc')})

In [3]: df.set_index(list('abc')).sort_index(level=list('ba'))
Out[3]: 
       d
a b c   
3 0 0  a
1 0 1  b
2 0 2  c

I suppose you could parametrize over level if you wanted to, as adding 'c' still results in invalid sorting:

In [4]: df.set_index(list('abc')).sort_index(level=list('bac'))
Out[4]: 
       d
a b c   
3 0 0  a
1 0 1  b
2 0 2  c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jschendel, switched to your simpler example.

'9,0.670,2,0.128,0.315,0.363'
data = StringIO(s)
df1 = (pd.read_csv(data)
.assign(b=lambda df: (df.b * 10).astype(int)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these df1, df2 are not clear at all can you rename to something more descriptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to the simple example by the other reviewer

@mahepe mahepe force-pushed the validation-test-for-sorting branch from 2a3bf94 to 1763b74 Compare May 29, 2019 04:47
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master and ping on green.

@@ -227,6 +228,30 @@ def test_stable_descending_multicolumn_sort(self):
kind='mergesort')
assert_frame_equal(sorted_df, expected)

def test_sort_multi_index(self):
# GH 25775
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on what this is testing, example is nice

@mahepe mahepe force-pushed the validation-test-for-sorting branch from 1763b74 to 5eb692b Compare June 3, 2019 18:32
@mahepe
Copy link
Contributor Author

mahepe commented Jun 4, 2019

all green @jreback

@jreback jreback added this to the 0.25.0 milestone Jun 5, 2019
@jreback jreback merged commit 8ef9a63 into pandas-dev:master Jun 5, 2019
@jreback
Copy link
Contributor

jreback commented Jun 5, 2019

thanks @mahepe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sort_index does not work with levels not aligned with index
5 participants