BUG: Index.get_loc raising incorrect error, closes #29189 #29700

jbrockmendel · 2019-11-18T23:54:09Z

closes groupby error when grouping by float index with non-unique values #29189
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd

Nice - thanks for giving this a look.

WillAyd · 2019-11-19T03:50:46Z

pandas/tests/groupby/test_groupby.py

+
+def test_groupby_duplicate_index():
+    # GH#29189 the groupby call here used to raise
+    ser = pd.Series([2, 5, 6, 8], index=[2.0, 4.0, 4.0, 5.0])


Do we not already have a similar test elsewhere? Not a blocker for this PR but feels like we should have a parametrized or more broad-reaching test for this in groupby

That's a good question and I dont have a good answer. i'll take a look tomorrow; most likely parametrizing/de-duplicating will need a dedicated pass

It looks like this affects the exceptions we get for other Index subclasses too, so I'll prioritize fleshing this out in the AM

WillAyd · 2019-11-19T03:52:35Z

pandas/_libs/index.pyx

+                right = values.searchsorted(val, side='right')
+            except TypeError:
+                # e.g. GH#29189 get_loc(None) with a Float64Index
+                raise KeyError(val)


Hmm what do you think about just catching the TypeError in groupby? Seems a little strange to catch and re-raise as a KeyError

Hmm what do you think about just catching the TypeError in groupby

My knee-jerk reaction is that this works against the direction we've been working on for a few weeks in groupby. But if there's a compelling case that we can't catch at a lower level, it definitely beats not-catching

Seems a little strange to catch and re-raise as a KeyError

I guess that depends on the official/desired signature/purpose of get_loc (the docstring doesnt say anything about what it raises), but I think the behavior below is probably not what we want:

ser = pd.Series([2, 5, 6, 8], index=[2.0, 4.0, 4.0, 5.0]) ser2 = ser.set_axis(ser.index.astype("int64")) >>> ser[None] # <-- TypeError >>> ser2[None] # <-- IndexError # if we slice so as to not have duplicates... >>> ser[::2][None] # <-- KeyError >>> ser2[::2][None] # <-- KeyError # if we slice so as to not be monotonic increasing... >>> ser[::-1][None] # <-- KeyError >>> ser2[::-1][None] # <-- KeyError

I'm not super familiar with those aspects of indexing but would have expected all to raise a TypeError when the type of the object being passed in is incompatible with the dtype and a KeyError when valid type but simply not present. probably more of a @jreback question

no, we specificially catch everything in .get_loc, which to enable it to always return an indexer (might be -1).

…x29189

jorisvandenbossche · 2019-11-19T13:25:22Z

Hmm what do you think about just catching the TypeError in groupby? Seems a little strange to catch and re-raise as a KeyError

Personally I like restricting the types of errors raised from our indexing codee, but based on other examples in pandas, I am also not fully sure the TypeError is "wrong" or the KeyError is "correct" (we are not being very consistent ..).
For example, here are three different errors for three different "mistyped" values (values that cannot be in the index based on their type):

>>> s = pd.Series(range(5), index=[1, 2, 3, 4, 4])

>>> s['a']   
...
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4721         k = self._convert_scalar_indexer(k, kind="getitem")
   4722         try:
-> 4723             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4724         except KeyError as e1:
   4725             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
...
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

KeyError: 'a'

>>> s[2.5]
...
   4719         k = com.values_from_object(key)
   4720 
-> 4721         k = self._convert_scalar_indexer(k, kind="getitem")
   4722         try:
   4723             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
...
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _convert_scalar_indexer(self, key, kind)
   3111             if kind in ["getitem", "ix"] and is_float(key):
   3112                 if not self.is_floating():
-> 3113                     return self._invalid_indexer("label", key)
   3114 
   3115             elif kind in ["loc"] and is_float(key):
...
TypeError: cannot do label indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [2.5] of <class 'float'>

>>> s[None]
...
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-22-32b37d43effc> in <module>
----> 1 s[None]

~/miniconda3/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
   1062         key = com.apply_if_callable(key, self)
   1063         try:
-> 1064             result = self.index.get_value(self, key)
   1065 
   1066             if not is_scalar(result):

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   4741             # python 3
   4742             if is_scalar(key):  # pragma: no cover
-> 4743                 raise IndexError(key)
   4744             raise InvalidIndexError(key)
   4745 

IndexError: None

For non-duplicated index or for RangeIndex, the case of s[None] also gives KeyError, so in that regard (consistency with how None is handled in other index types), the KeyError seems to be a better option.

jbrockmendel · 2019-11-19T16:24:22Z

@WillAyd @jreback @jorisvandenbossche thanks for weighing in.

The overall "what should indexing methods raise" issue is a pretty big one and I'd like to segment the problem if possible. Can we achieve consensus on any of the following:

get_loc should raise TypeError if passed a non-hashable
The exception raised by get_loc should not depend on whether the index is unique
The exception raised by get_loc should not depend on whether the index is monotonic

Anything else that can be added to this list?

WillAyd · 2019-11-19T18:00:05Z

get_loc should raise TypeError if passed a non-hashable

Hmm not sure about hashability making a difference. I was thinking more if we try to index say a IntIndex with a string that it should raise a TypeError since the IntIndex would never hold a string. But I think Jeff is saying that this should return -1. Not an expert on usage here so I'll defer

But yes to points 2 and 3 - I wouldn't think either of those should have an impact on the Exception that gets raised

jbrockmendel · 2019-11-19T18:19:59Z

@WillAyd for the purposes of 1. I'm not looking to say anything shouldn't raise TypeError, just that non-hashables should (and currently do; its the first thing checked by each of the get_loc implementations in index.pyx)

jreback · 2019-11-20T12:40:26Z

@WillAyd @jreback @jorisvandenbossche thanks for weighing in.

The overall "what should indexing methods raise" issue is a pretty big one and I'd like to segment the problem if possible. Can we achieve consensus on any of the following:

get_loc should raise TypeError if passed a non-hashable

The exception raised by get_loc should not depend on whether the index is unique

The exception raised by get_loc should not depend on whether the index is monotonic

Anything else that can be added to this list?

2 and 3 for certain yes.

1 yes as well ideally. likely though we don't do this in many places and are translating a TypeError to -1 and returning. This would likely involved catching the TypeError at a higherlevel (and then raising KeyError which is ok).

jreback · 2019-11-21T13:09:30Z

rebase & can you add a whatsnew note here for the bug fix

…x29189

jbrockmendel · 2019-11-21T16:07:28Z

@WillAyd any objection to moving forward with this as a bugfix and revisiting the larger get_loc raising behavior separately?

WillAyd · 2019-11-21T16:08:20Z

No objections from me

jreback · 2019-11-25T23:45:30Z

thanks

…andas-dev#29700)

jbrockmendel added 2 commits November 18, 2019 15:53

BUG: Index.get_loc raising incorrect error, closes pandas-dev#29189

e6ea4da

blackify

f3e6147

WillAyd reviewed Nov 19, 2019

View reviewed changes

WillAyd added the Groupby label Nov 19, 2019

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

d4eb26d

…x29189

jreback added this to the 1.0 milestone Nov 21, 2019

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

a57f197

…x29189

whatsnew

edcf818

jreback added the Error Reporting Incorrect or improved errors from pandas label Nov 25, 2019

jreback merged commit 7eb0db3 into pandas-dev:master Nov 25, 2019

jbrockmendel deleted the fix29189 branch November 25, 2019 23:49

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

BUG: Index.get_loc raising incorrect error, closes pandas-dev#29189 (p…

2722872

…andas-dev#29700)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

BUG: Index.get_loc raising incorrect error, closes pandas-dev#29189 (p…

79f6740

…andas-dev#29700)

simonjayhawkins mentioned this pull request Mar 31, 2020

BUG: duplicate indexing on non-integer index with positional indexers failing in py3 #13427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Index.get_loc raising incorrect error, closes #29189 #29700

BUG: Index.get_loc raising incorrect error, closes #29189 #29700

jbrockmendel commented Nov 18, 2019

WillAyd left a comment

WillAyd Nov 19, 2019

jbrockmendel Nov 19, 2019

jbrockmendel Nov 19, 2019

WillAyd Nov 19, 2019

jbrockmendel Nov 19, 2019

WillAyd Nov 19, 2019

jreback Nov 19, 2019

jorisvandenbossche commented Nov 19, 2019

jbrockmendel commented Nov 19, 2019

WillAyd commented Nov 19, 2019

jbrockmendel commented Nov 19, 2019

jreback commented Nov 20, 2019

jreback commented Nov 21, 2019

jbrockmendel commented Nov 21, 2019

WillAyd commented Nov 21, 2019

jreback commented Nov 25, 2019

BUG: Index.get_loc raising incorrect error, closes #29189 #29700

BUG: Index.get_loc raising incorrect error, closes #29189 #29700

Conversation

jbrockmendel commented Nov 18, 2019

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Nov 19, 2019

Choose a reason for hiding this comment

jbrockmendel Nov 19, 2019

Choose a reason for hiding this comment

jbrockmendel Nov 19, 2019

Choose a reason for hiding this comment

WillAyd Nov 19, 2019

Choose a reason for hiding this comment

jbrockmendel Nov 19, 2019

Choose a reason for hiding this comment

WillAyd Nov 19, 2019

Choose a reason for hiding this comment

jreback Nov 19, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 19, 2019

jbrockmendel commented Nov 19, 2019

WillAyd commented Nov 19, 2019

jbrockmendel commented Nov 19, 2019

jreback commented Nov 20, 2019

jreback commented Nov 21, 2019

jbrockmendel commented Nov 21, 2019

WillAyd commented Nov 21, 2019

jreback commented Nov 25, 2019