-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Index.get_loc raising incorrect error, closes #29189 #29700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - thanks for giving this a look.
|
||
def test_groupby_duplicate_index(): | ||
# GH#29189 the groupby call here used to raise | ||
ser = pd.Series([2, 5, 6, 8], index=[2.0, 4.0, 4.0, 5.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not already have a similar test elsewhere? Not a blocker for this PR but feels like we should have a parametrized or more broad-reaching test for this in groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question and I dont have a good answer. i'll take a look tomorrow; most likely parametrizing/de-duplicating will need a dedicated pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this affects the exceptions we get for other Index subclasses too, so I'll prioritize fleshing this out in the AM
right = values.searchsorted(val, side='right') | ||
except TypeError: | ||
# e.g. GH#29189 get_loc(None) with a Float64Index | ||
raise KeyError(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what do you think about just catching the TypeError in groupby? Seems a little strange to catch and re-raise as a KeyError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what do you think about just catching the TypeError in groupby
My knee-jerk reaction is that this works against the direction we've been working on for a few weeks in groupby. But if there's a compelling case that we can't catch at a lower level, it definitely beats not-catching
Seems a little strange to catch and re-raise as a KeyError
I guess that depends on the official/desired signature/purpose of get_loc (the docstring doesnt say anything about what it raises), but I think the behavior below is probably not what we want:
ser = pd.Series([2, 5, 6, 8], index=[2.0, 4.0, 4.0, 5.0])
ser2 = ser.set_axis(ser.index.astype("int64"))
>>> ser[None] # <-- TypeError
>>> ser2[None] # <-- IndexError
# if we slice so as to not have duplicates...
>>> ser[::2][None] # <-- KeyError
>>> ser2[::2][None] # <-- KeyError
# if we slice so as to not be monotonic increasing...
>>> ser[::-1][None] # <-- KeyError
>>> ser2[::-1][None] # <-- KeyError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with those aspects of indexing but would have expected all to raise a TypeError when the type of the object being passed in is incompatible with the dtype and a KeyError when valid type but simply not present. probably more of a @jreback question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we specificially catch everything in .get_loc
, which to enable it to always return an indexer (might be -1).
Personally I like restricting the types of errors raised from our indexing codee, but based on other examples in pandas, I am also not fully sure the TypeError is "wrong" or the KeyError is "correct" (we are not being very consistent ..). >>> s = pd.Series(range(5), index=[1, 2, 3, 4, 4]) >>> s['a']
...
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4721 k = self._convert_scalar_indexer(k, kind="getitem")
4722 try:
-> 4723 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4724 except KeyError as e1:
4725 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
...
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()
KeyError: 'a' >>> s[2.5]
...
4719 k = com.values_from_object(key)
4720
-> 4721 k = self._convert_scalar_indexer(k, kind="getitem")
4722 try:
4723 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
...
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _convert_scalar_indexer(self, key, kind)
3111 if kind in ["getitem", "ix"] and is_float(key):
3112 if not self.is_floating():
-> 3113 return self._invalid_indexer("label", key)
3114
3115 elif kind in ["loc"] and is_float(key):
...
TypeError: cannot do label indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2.5] of <class 'float'> >>> s[None]
...
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()
TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-22-32b37d43effc> in <module>
----> 1 s[None]
~/miniconda3/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
1062 key = com.apply_if_callable(key, self)
1063 try:
-> 1064 result = self.index.get_value(self, key)
1065
1066 if not is_scalar(result):
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4741 # python 3
4742 if is_scalar(key): # pragma: no cover
-> 4743 raise IndexError(key)
4744 raise InvalidIndexError(key)
4745
IndexError: None For non-duplicated index or for RangeIndex, the case of |
@WillAyd @jreback @jorisvandenbossche thanks for weighing in. The overall "what should indexing methods raise" issue is a pretty big one and I'd like to segment the problem if possible. Can we achieve consensus on any of the following:
Anything else that can be added to this list? |
Hmm not sure about hashability making a difference. I was thinking more if we try to index say a But yes to points 2 and 3 - I wouldn't think either of those should have an impact on the Exception that gets raised |
@WillAyd for the purposes of 1. I'm not looking to say anything shouldn't raise |
2 and 3 for certain yes. 1 yes as well ideally. likely though we don't do this in many places and are translating a TypeError to -1 and returning. This would likely involved catching the TypeError at a higherlevel (and then raising KeyError which is ok). |
rebase & can you add a whatsnew note here for the bug fix |
@WillAyd any objection to moving forward with this as a bugfix and revisiting the larger get_loc raising behavior separately? |
No objections from me |
thanks |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
cc @WillAyd