-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New Interval / IntervalIndex behavior spec #16386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16386 +/- ##
==========================================
- Coverage 91.57% 91.56% -0.02%
==========================================
Files 150 150
Lines 48913 48959 +46
==========================================
+ Hits 44794 44831 +37
- Misses 4119 4128 +9
Continue to review full report at Codecov.
|
I just took a quick look and this looks great! One comment: it would be great to convert your tests for Series indexing into unit tests on the IntervalIndex itself, which would let us avoids testing the unrelated indexing code. In most cases, the methods to test are |
@shoyer sorry, I might need a little more explanation. Where can I find those methods, i.e. where should I move my unit tests to? |
See here for the existing IntervalIndex tests: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/indexes/test_interval.py#L329 |
@shoyer sorry, still lost -- don't know the internals of pandas too well I guess. What exactly do you mean by putting the unit tests on IntervalIndex itself rather than on the indexing code? I think the reason I'm confused is I don't know how |
No worries! The "bottom" methods are .get_loc(), .slice_locs() and
.get_indexer():
http://pandas.pydata.org/pandas-docs/stable/internals.html#indexing
…On Sun, May 21, 2017 at 10:49 AM, Alexander Lenail ***@***.*** > wrote:
@shoyer <https://github.com/shoyer> sorry, still lost -- don't know the
internals of pandas too well I guess.
What exactly do you mean by putting the unit tests on IntervalIndex itself
rather than on the indexing code? I think the reason I'm confused is I
don't know how .loc works -- does it call .get_loc? What is the "bottom"
method that we should be testing? I don't see loc in
pandas/pandas/core/indexes/interval.py ...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16386 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1m8HMYPNxgY6g7-KoWYwYB1FCpGaks5r8E8TgaJpZM4Nftfz>
.
|
|
|
@zfrenchee
Yes.
All these methods are defined on
Yes, almost always. The notable exception is for cases where it performs integer indexing rather than label based indexing.
The tests in The tests in Both unit tests and integration tests are useful, but only unit tests need to comprehensively check every possible behavior (there are too many ways to combine things for integration tests). |
@shoyer, @jreback suggested adding the tests to |
@shoyer still waiting to hear about where these tests belong. Since I basically only added unit tests, I think it might make sense to move all the tests to indexes from indexing. Let me know what you think. |
@zfrenchee in your case, yes, I think almost all of your unit tests should be in |
@shoyer would you take a look? I added a new
Let me know what you think / where we should go from here |
To gain some intuition for how this should work, I recommend trying indexing a series with integer labels, e.g., I think Generally, I like the way this is coming together -- these test cases will be super helpful! See inline comments below. |
|
||
right = IntervalIndex.from_tuples([(0, 1), (2, 3)], closed='right') | ||
|
||
pytest.raises(KeyError, right.get_loc(-0.5) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work. You need to use one of these forms:
pytest.raises(KeyError, right.get_loc, -0.5)
or:
with pytest.raises(KeyError):
right.get_loc(-0.5)
The second version is usually easier to read because you can use normal syntax, but in this case (with one argument) there might be something to be said for fitting on one line.
|
||
pytest.raises(KeyError, right.get_loc(-0.5) ) | ||
pytest.raises(KeyError, right.get_loc(0) ) | ||
assert right.get_loc(0.5) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per PEP8, we don't use spaces for alignment. Just a single space is preferred.
pytest.raises(KeyError, right.get_loc(Interval(0, 2, closed='both')) ) | ||
pytest.raises(KeyError, right.get_loc(Interval(2.5, 3, closed='both')) ) | ||
|
||
left = IntervalIndex.from_tuples([(0, 1), (2, 3)], closed='left') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is long enough for one test. Can you make this a second method, e.g., test_get_loc_vlaue_closed_left
?
|
||
idx = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)]) | ||
|
||
assert Interval(1, 3, closed='right').covers(idx) == np.array([1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use np.testing.assert_array_equal
here instead of ==
(remember numpy arrays overload ==
to do element-wise comparison).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope use
tm.assert_numpy_array_equal
we never directly use numpy comparison routines
the linter will fail if it sees np.testing fyi
assert Interval(1, 3, closed='both').covers(Interval(1, 3, closed='left')) | ||
assert Interval(1, 3, closed='both').covers(Interval(1, 3, closed='both')) | ||
|
||
idx = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, add closed='left'
or closed='right'
here? (just so I don't need to remember the default
idx2 = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)], closed='left') | ||
idx3 = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)], closed='both') | ||
|
||
assert idx.covers(idx1) == (np.array([0,1,2,2]), np.array([0,1,1,2])) # note: I had to choose an arbitrary ordering. If this test fails, double check the test too... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the order of the last two entries here? I'm actually OK with this -- let's say that returning indices in lexicographically-sorted order is part of the spec.
|
||
# slice of scalar | ||
with pytest.raises(NotImplementedError): | ||
s[0:4] ## not sure what the behvaior should be here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rule for indexing with integers in []
is to fall back to positional indexing... but honestly I have no idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, slicing with integers in []
is always* positional, so I think should be here as well
*always = for all types of indexes expect ... ra ra .. floats!
But I am fine with raising a NotImplementedError to start with if this is not yet implemented correctly, but it seems this is already working on master:
In [33]: df = pd.DataFrame(['a', 'b', 'c'], columns=['col'],
...: index=pd.IntervalIndex.from_tuples([(11, 13), (16, 17), (14, 15)], closed='both'))
...:
In [34]: df
Out[34]:
col
[11, 13] a
[16, 17] b
[14, 15] c
In [35]: df[:2]
Out[35]:
col
[11, 13] a
[16, 17] b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean expect or except?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're suggesting that df[:2]
and df[:2.0]
have very different behaviors? This might be a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might indeed be a bad idea .. :-) But I think the following is probably the most consistent and safe given the current indexing mess/complexity:
df[:2]
does positional indexing (so similar todf.iloc[:2]
as it does for all other types of indexing (apart from float index, which I think we should regard as a bug)- disallow
df[:2.0]
? If you want label based slicing based on the interval ranges, you can usedf.loc[:2.0]
(although we still have to decide what the semantics of that should be)
Apart from the integers (first case), also slicing with exact Interval objects would be allowed.
@pytest.mark.xfail(reason="new indexing tests for issue 16316") | ||
def test_non_unique_updated_behavior(self): | ||
|
||
# Actually I think we should remove this test? Not sure what exactly it's meant to gauge... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not entire sure. I don't think this test needs to change though -- the existing test looks fine.
# this is confusing to me. I would have done: | ||
# expected = s | ||
# result = s.loc[Interval(1, 3):] | ||
# tm.assert_series_equal(expected, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails because pandas doesn't support slicing if there is a duplicate key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be fully true. As long as they are still sorted, slicing with duplicates seems to work (at least for this test case):
In [36]: s = pd.Series(range(3), index=[1,1,2])
In [37]: s.loc[1:]
Out[37]:
1 0
1 1
2 2
dtype: int64
# non-unique | ||
with pytest.raises(ValueError): | ||
s[[Interval(1, 3)]] | ||
# not sure why the behavior for [[]] is different than []... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes down a different code path, get_indexer
vs get_loc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently fails (both for [[]]
and .loc[[]]
), but shouldn't both just work? (in that case, I would make the test test the correct behaviour, it is marked as xfail anyhow)
good to know, thanks!
…On Thu, May 25, 2017 at 7:52 PM Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/tests/indexes/test_interval.py
<#16386 (comment)>:
> +
+ assert Interval(1, 3, closed='right').covers(Interval(1, 3, closed='right'))
+ assert not Interval(1, 3, closed='right').covers(Interval(1, 3, closed='left'))
+ assert not Interval(1, 3, closed='right').covers(Interval(1, 3, closed='both'))
+
+ assert not Interval(1, 3, closed='left').covers(Interval(1, 3, closed='right'))
+ assert Interval(1, 3, closed='left').covers(Interval(1, 3, closed='left'))
+ assert not Interval(1, 3, closed='left').covers(Interval(1, 3, closed='both'))
+
+ assert Interval(1, 3, closed='both').covers(Interval(1, 3, closed='right'))
+ assert Interval(1, 3, closed='both').covers(Interval(1, 3, closed='left'))
+ assert Interval(1, 3, closed='both').covers(Interval(1, 3, closed='both'))
+
+ idx = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)])
+
+ assert Interval(1, 3, closed='right').covers(idx) == np.array([1, 2])
nope use
tm.assert_numpy_array_equal
we never directly use numpy comparison routines
the linter will fail if it sees np.testing fyi
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16386 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1khw7Ui5dtvxfyKz4ql1Tg80ezH8ks5r9j5_gaJpZM4Nftfz>
.
|
@shoyer slice locs is a funky method. It seems special-cased when the index is monotonically increasing or decreasing -- is that right? Thanks for your review by the way! I fixed some stuff up, I think it's mostly ready for another look. I'm particularly interested in hearing:
Finally, I think (but am not totally sure) I need to take a long look at |
@shoyer I've taken another quick pass, but I'm not entirely sure what to do for slicing. Would you take a quick look and answer the questions I posted above as well? |
Yes, correct.
At least in this case, I would still test these separately. More detailed comments to follow after I look at your changes in detail... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time, hold off on rebasing until the end. Otherwise rebasing breaks GitHub so I can't see what has changed since my last review :(
idx2 = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)], closed='left') | ||
idx3 = IntervalIndex.from_tuples([(0, 1), (2, 3), (1, 3)], closed='both') | ||
|
||
tm.assert_numpy_array_equal(idx.covers(idx1), (np.array([0,1,2,2]), np.array([0,1,1,2]))) # note: assert_numpy_array_equal is the wrong thing to call here, since we're testing for equality with a tuple of np arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, use tuple unpacking and two calls to tm.assert_numpy_array_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer or perhaps we should add a method to tm
, or just a test method to this file/class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not :).
It would also be fine to cast the result to a numpy array, e.g., np.array(idx.covers(idx1))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you actually want to write this comparison like
lidx, ridx = idx.covers(idx1)
tm.asssert_numpy_array_equal(lidx, ....)
tm.asssert_numpy_array_equal(ridx, ....)
or if that gets cumbersome, you can write a little helper
def compare(result, expected):
lidx, ridx = result
lidx_expected, ridx_expected = expected
tm.assert_numpy_array_equal(lidx, lidx_expected)
tm.assert_numpy_array_equal(ridx, ridx_expected)
and call like
compare(idx.covers, (np....., np.....))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
@@ -491,6 +642,206 @@ def testcontains(self): | |||
assert not i.contains(20) | |||
assert not i.contains(-20) | |||
|
|||
# Why are there two tests called test contains? this should be cleared up a little, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
@pytest.mark.xfail(reason="new indexing tests for issue 16316") | ||
def test_get_indexer_subintervals_updated_behavior(self): | ||
|
||
# target = IntervalIndex.from_breaks(np.linspace(0, 2, 5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what the idea is with these commented out lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks ok. need to fix the xfails and then get passing.
from __future__ import division | ||
|
||
import pytest | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no literally add that with a nice xfail message.
I want this to skip this entire module for now, rather than showing a skip for each test. This just pollutes the xfail output (which is already getting annoying large).
from __future__ import division | ||
|
||
import pytest | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if you add the xfail up top, then you can remove all of the xfails on individual functions.
|
||
from pandas import Series, IntervalIndex, Interval | ||
import pandas.util.testing as tm | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same deal here, skip the entire module for now and remove the xfails on individual tests.
Hello @zfrenchee! Thanks for updating the PR.
|
@jreback done. What next? |
ping when green |
Hello @zfrenchee! Thanks for updating the PR.
|
Hello @zfrenchee! Thanks for updating the PR.
|
you know that
checks pep FYI (this is from the top of the PR) |
@jreback my local pep isn't respecting the 80 char limit, when I autoformat the file, which is why I keep using this one to check. We have different versions / configs for pep seems like. |
Hello @zfrenchee! Thanks for updating the PR.
|
…e if this passes tests
Hello @zfrenchee! Thanks for updating the PR.
|
Hello @zfrenchee! Thanks for updating the PR.
|
move to separate sub-dirs
ok I pushed some fixes. ping on green. |
@jreback green |
thanks @zfrenchee took a while, but glad we got this in! next up......patches! |
git diff upstream/master --name-only -- '*.py' | flake8 --diff
@jreback