Skip to content

BUG: DataFrame/Series.loc improperly allows lookups of boolean labels/slices #20432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jschendel opened this issue Mar 21, 2018 · 12 comments · Fixed by #40726
Closed

BUG: DataFrame/Series.loc improperly allows lookups of boolean labels/slices #20432

jschendel opened this issue Mar 21, 2018 · 12 comments · Fixed by #40726
Assignees
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jschendel
Copy link
Member

Code Sample, a copy-pastable example if possible

Basic example of the issue, specific to TimedeltaIndex, xref #20408 (comment)

In [2]: s = pd.Series(list('abcde'), pd.timedelta_range(0, 4, freq='ns'))

In [3]: s.loc[True]
Out[3]: 'b'

In [4]: s.loc[False:True]
Out[4]:
00:00:00           a
00:00:00.000000    b
Freq: N, dtype: object

Indexing with both boolean labels and slices was successful, which doesn't seem right.

I investigated this same behavior across various index types for both Series and DataFrame, and produced the summary below.

Summary

  • 'raises' column indicates if the indexing operation raised an exception
  • 'exception' column indicates the type of exception raised
                                  raises  exception
CategoricalIndex DataFrame label    True   KeyError
                           slice    True   KeyError
                 Series    label   False        NaN
                           slice    True   KeyError
DatetimeIndex    DataFrame label   False        NaN
                           slice   False        NaN
                 Series    label   False        NaN
                           slice   False        NaN
Float64Index     DataFrame label   False        NaN
                           slice   False        NaN
                 Series    label   False        NaN
                           slice   False        NaN
Index            DataFrame label   False        NaN
                           slice   False        NaN
                 Series    label   False        NaN
                           slice   False        NaN
Int64Index       DataFrame label    True   KeyError
                           slice   False        NaN
                 Series    label    True   KeyError
                           slice   False        NaN
IntervalIndex    DataFrame label    True  TypeError
                           slice    True  TypeError
                 Series    label    True  TypeError
                           slice    True  TypeError
MultiIndex       DataFrame label    True   KeyError
                           slice    True   KeyError
                 Series    label    True   KeyError
                           slice    True   KeyError
PeriodIndex      DataFrame label    True   KeyError
                           slice   False        NaN
                 Series    label    True   KeyError
                           slice   False        NaN
RangeIndex       DataFrame label    True   KeyError
                           slice   False        NaN
                 Series    label    True   KeyError
                           slice   False        NaN
TimedeltaIndex   DataFrame label   False        NaN
                           slice   False        NaN
                 Series    label   False        NaN
                           slice   False        NaN
UInt64Index      DataFrame label    True   KeyError
                           slice   False        NaN
                 Series    label    True   KeyError
                           slice   False        NaN

Code to produce summary

indexes = [
    pd.RangeIndex(4),
    pd.Int64Index(range(4)),
    pd.UInt64Index(range(4)),
    pd.Float64Index(range(4)),
    pd.CategoricalIndex(range(4)),
    pd.date_range(0, periods=4, freq='ns'),
    pd.timedelta_range(0, periods=4, freq='ns'),
    pd.interval_range(0, periods=4),
    pd.Index([0, 1, 2, 3], dtype=object),
    pd.MultiIndex.from_product([[0, 1], [0, 1]]),
    pd.period_range('2018Q1', freq='Q', periods=4),  # need better example here
]

result = {}
for index in indexes:
    index_name = type(index).__name__
    s = pd.Series(list('abcd'), index=index)
    for obj in (s, s.to_frame()):
        obj_name = type(obj).__name__

        # check single label
        key = (index_name, obj_name, 'label')
        try:
            obj.loc[True]
            result[key] = {'raises': False}
        except Exception as e:
            result[key] = {'raises': True, 'exception': type(e).__name__}

        # check slice
        key = (index_name, obj_name, 'slice')
        try:
            obj.loc[False:True]
            result[key] = {'raises': False}
        except Exception as e:
            result[key] = {'raises': True, 'exception': type(e).__name__}

result = pd.DataFrame.from_dict(result, orient='index')

Expected Output

I'd generally expect all of these operations to raise a KeyError, which a couple potential exceptions:

  • I'd be open to an argument for numeric indexes casting to integer equivalent. Seems like this should at least be consistent for labels vs slices, which it is not right now.
  • Maybe we should allow conversion for the object dtype Index?
@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

yeah I agree slice should certainly raise, prob TypeError. There is some small precedent for a label (or even list-indexer) of booleans to work on a boolean inferred index type (so very small exception here).

But I think making all of these TypeError is prob reasonable. I wouldn't cast.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Compat pandas objects compatability with Numpy or Python functions Difficulty Advanced labels Mar 22, 2018
@jreback jreback added this to the Next Major Release milestone Mar 22, 2018
@mroeschke
Copy link
Member

Looks like these raise sensible errors on master now. Could use a test

In [39]: pd.__version__
Out[39]: '1.1.0.dev0+1216.gd4d58f960'

In [40]: In [2]: s = pd.Series(list('abcde'), pd.timedelta_range(0, 4, freq='ns'))
    ...:
    ...: In [3]: s.loc[True]
KeyError: True

In [41]: In [4]: s.loc[False:True]
TypeError: Unexpected type for 'value': <class 'bool'>

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves labels Apr 10, 2020
@karuppiah7890
Copy link

I'm going to give this a shot! 😄

@karuppiah7890
Copy link

take

@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 21, 2020
@karuppiah7890
Copy link

karuppiah7890 commented Oct 6, 2020

Any suggestions on how to go about this? @mroeschke @MarcoGorelli

I noticed some files related to loc method in these places

code - pandas/core/indexing.py

tests:
pandas/tests/indexing/test_loc.py
pandas/tests/indexing/multiindex/test_loc.py
pandas/tests/series/indexing/test_loc.py

I still have to understand the existing tests and then write the new ones. I'm pretty new to Python 😅 Let me know if this is a big issue to pickup. I can pickup something smaller in that case.

PS: I'm still stuck at my machine setup - I tried VS Code Remote Containers setup but it didn't work out, I think it's because it took up lots of RAM and was killed because of Out Of Memory, as I have set some limits in my Docker Desktop, like 1GB RAM. I plan to setup the whole thing in my local. Is that okay? I hope it doesn't affect any of my other python related software / tools I use, given it is more sandboxed with virtualenv (I'm assuming. Python newbie here)

@MarcoGorelli
Copy link
Member

Hi @karuppiah7890

I plan to setup the whole thing in my local. Is that okay?

Yup, that's what I've done, I have a pandas-dev conda environment, feel free to ask if you have trouble setting up

@karuppiah7890
Copy link

I have setup the environment. I tried to run the tests, but there were tons, so stopped it in between. Can someone help me understand what kind of test is needed here?

Are we looking for tests that assert for KeyError and TypeError ?

In [40]: In [2]: s = pd.Series(list('abcde'), pd.timedelta_range(0, 4, freq='ns'))
    ...:
    ...: In [3]: s.loc[True]
KeyError: True

In [41]: In [4]: s.loc[False:True]
TypeError: Unexpected type for 'value': <class 'bool'>

@karuppiah7890
Copy link

I guess I'll have to first learn a bit more about the existing tests. I'll do that and get back here. I'll look for any suggestions over here, please do post if you have any

@karuppiah7890 karuppiah7890 removed their assignment Dec 29, 2020
@karuppiah7890
Copy link

Since I'm not working on this anymore, I'll leave it for others to pick it up. 🙈

@DriesSchaumont
Copy link
Member

take

@DriesSchaumont
Copy link
Member

Seems like this is still not working for IntervalIndex:

>>> import pandas as pd
>>> df = pd.DataFrame(range(4))
>>> df.index = pd.interval_range(0, periods=4)
>>> df.loc[True]
0    0
Name: (0, 1], dtype: int64

Same as for the regular Index:

df = pd.DataFrame(range(4))
df.index = pd.Index([0, 1, 2, 3], dtype=object)
df.loc[True]
0    1
Name: 1, dtype: int64

Do these cases need seperate issues?

@mzeitlin11
Copy link
Member

I don't think so, provided the same fix can handle both cases. Could open another issue if you find out one or the other needs some special attention for some reason.

@jreback jreback removed this from the Contributions Welcome milestone Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants