Skip to content

BUG: Index.get_loc raising incorrect error, closes #29189 #29700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 25, 2019

Conversation

jbrockmendel
Copy link
Member

cc @WillAyd

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - thanks for giving this a look.


def test_groupby_duplicate_index():
# GH#29189 the groupby call here used to raise
ser = pd.Series([2, 5, 6, 8], index=[2.0, 4.0, 4.0, 5.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not already have a similar test elsewhere? Not a blocker for this PR but feels like we should have a parametrized or more broad-reaching test for this in groupby

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question and I dont have a good answer. i'll take a look tomorrow; most likely parametrizing/de-duplicating will need a dedicated pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this affects the exceptions we get for other Index subclasses too, so I'll prioritize fleshing this out in the AM

right = values.searchsorted(val, side='right')
except TypeError:
# e.g. GH#29189 get_loc(None) with a Float64Index
raise KeyError(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what do you think about just catching the TypeError in groupby? Seems a little strange to catch and re-raise as a KeyError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what do you think about just catching the TypeError in groupby

My knee-jerk reaction is that this works against the direction we've been working on for a few weeks in groupby. But if there's a compelling case that we can't catch at a lower level, it definitely beats not-catching

Seems a little strange to catch and re-raise as a KeyError

I guess that depends on the official/desired signature/purpose of get_loc (the docstring doesnt say anything about what it raises), but I think the behavior below is probably not what we want:

ser = pd.Series([2, 5, 6, 8], index=[2.0, 4.0, 4.0, 5.0])
ser2 = ser.set_axis(ser.index.astype("int64"))

>>> ser[None]   # <-- TypeError
>>> ser2[None]   # <-- IndexError

# if we slice so as to not have duplicates...
>>> ser[::2][None]  # <-- KeyError
>>> ser2[::2][None]  # <-- KeyError

# if we slice so as to not be monotonic increasing...
>>> ser[::-1][None]  # <-- KeyError
>>> ser2[::-1][None]  # <-- KeyError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with those aspects of indexing but would have expected all to raise a TypeError when the type of the object being passed in is incompatible with the dtype and a KeyError when valid type but simply not present. probably more of a @jreback question

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we specificially catch everything in .get_loc, which to enable it to always return an indexer (might be -1).

@jorisvandenbossche
Copy link
Member

Hmm what do you think about just catching the TypeError in groupby? Seems a little strange to catch and re-raise as a KeyError

Personally I like restricting the types of errors raised from our indexing codee, but based on other examples in pandas, I am also not fully sure the TypeError is "wrong" or the KeyError is "correct" (we are not being very consistent ..).
For example, here are three different errors for three different "mistyped" values (values that cannot be in the index based on their type):

>>> s = pd.Series(range(5), index=[1, 2, 3, 4, 4])  
>>> s['a']   
...
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4721         k = self._convert_scalar_indexer(k, kind="getitem")
   4722         try:
-> 4723             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4724         except KeyError as e1:
   4725             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
...
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

KeyError: 'a'
>>> s[2.5]
...
   4719         k = com.values_from_object(key)
   4720 
-> 4721         k = self._convert_scalar_indexer(k, kind="getitem")
   4722         try:
   4723             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
...
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _convert_scalar_indexer(self, key, kind)
   3111             if kind in ["getitem", "ix"] and is_float(key):
   3112                 if not self.is_floating():
-> 3113                     return self._invalid_indexer("label", key)
   3114 
   3115             elif kind in ["loc"] and is_float(key):
...
TypeError: cannot do label indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2.5] of <class 'float'>
>>> s[None]
...
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-22-32b37d43effc> in <module>
----> 1 s[None]

~/miniconda3/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
   1062         key = com.apply_if_callable(key, self)
   1063         try:
-> 1064             result = self.index.get_value(self, key)
   1065 
   1066             if not is_scalar(result):

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4741             # python 3
   4742             if is_scalar(key):  # pragma: no cover
-> 4743                 raise IndexError(key)
   4744             raise InvalidIndexError(key)
   4745 

IndexError: None

For non-duplicated index or for RangeIndex, the case of s[None] also gives KeyError, so in that regard (consistency with how None is handled in other index types), the KeyError seems to be a better option.

@jbrockmendel
Copy link
Member Author

@WillAyd @jreback @jorisvandenbossche thanks for weighing in.

The overall "what should indexing methods raise" issue is a pretty big one and I'd like to segment the problem if possible. Can we achieve consensus on any of the following:

  1. get_loc should raise TypeError if passed a non-hashable
  2. The exception raised by get_loc should not depend on whether the index is unique
  3. The exception raised by get_loc should not depend on whether the index is monotonic

Anything else that can be added to this list?

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2019

  1. get_loc should raise TypeError if passed a non-hashable

Hmm not sure about hashability making a difference. I was thinking more if we try to index say a IntIndex with a string that it should raise a TypeError since the IntIndex would never hold a string. But I think Jeff is saying that this should return -1. Not an expert on usage here so I'll defer

But yes to points 2 and 3 - I wouldn't think either of those should have an impact on the Exception that gets raised

@jbrockmendel
Copy link
Member Author

@WillAyd for the purposes of 1. I'm not looking to say anything shouldn't raise TypeError, just that non-hashables should (and currently do; its the first thing checked by each of the get_loc implementations in index.pyx)

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

@WillAyd @jreback @jorisvandenbossche thanks for weighing in.

The overall "what should indexing methods raise" issue is a pretty big one and I'd like to segment the problem if possible. Can we achieve consensus on any of the following:

  1. get_loc should raise TypeError if passed a non-hashable
  2. The exception raised by get_loc should not depend on whether the index is unique
  3. The exception raised by get_loc should not depend on whether the index is monotonic

Anything else that can be added to this list?

2 and 3 for certain yes.

1 yes as well ideally. likely though we don't do this in many places and are translating a TypeError to -1 and returning. This would likely involved catching the TypeError at a higherlevel (and then raising KeyError which is ok).

@jreback jreback added this to the 1.0 milestone Nov 21, 2019
@jreback
Copy link
Contributor

jreback commented Nov 21, 2019

rebase & can you add a whatsnew note here for the bug fix

@jbrockmendel
Copy link
Member Author

@WillAyd any objection to moving forward with this as a bugfix and revisiting the larger get_loc raising behavior separately?

@WillAyd
Copy link
Member

WillAyd commented Nov 21, 2019

No objections from me

@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Nov 25, 2019
@jreback jreback merged commit 7eb0db3 into pandas-dev:master Nov 25, 2019
@jreback
Copy link
Contributor

jreback commented Nov 25, 2019

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby error when grouping by float index with non-unique values
4 participants