BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

PAeberhard · 2018-12-04T14:41:43Z

Code Sample

import pandas as pd
pd.show_versions()

p_index = pd.PeriodIndex(["20181101 1100", "20181101 1200", "20181102 1300", "20181102 1400"], name='datetime', freq="B")
mi_series = pd.DataFrame([["A", "B", 1.0],
                          ["A", "C", 2.0],
                          ["Z", "Q", 3.0],
                          ["W", "F", 4.0]], index=p_index, columns=["ONE", "TWO", "VALUES"])
mi_series = mi_series.set_index(["ONE", "TWO"], append=True)["VALUES"]

# mi_series:
# datetime    ONE  TWO
# 2018-11-01  A    B      1.0
#                  C      2.0
# 2018-11-02  Z    Q      3.0
#             W    F      4.0
# Name: VALUES, dtype: float64

mi_series_2levels = mi_series.reset_index("ONE", drop=True)

# Should return a single value:
mi_series.loc[(p_index[0], "A", "B")]
# Expected: 1.0
# Actual:
# ONE  TWO
# A    B      1.0
#      C      2.0
# Name: VALUES, dtype: float64

# Workaround:
assert mi_series.loc[p_index[0]].loc[("A", "B")] == 1.0

# If the MultiIndex has only two levels, it works as expected:
assert mi_series_2levels.loc[(p_index[0], "B")] == 1.0

# Here is probably why:
from pandas.core.tools.datetimes import parse_time_string
mi_key = (pd.Period("2018-11-01", freq="B"), "1", "1")

# This returns a nonsensical tuple:
parse_time_string(mi_key)

# And this returns True when it probably shouldn't:
mi_key in mi_series.index.levels[0]

Problem description

For a Series with a MultiIndex that 1) has exactly 3 levels, and 2) the first level is a PeriodIndex (see examples above), indexing by a triple (series.loc[(first, second, third)]) behaves the same as series.loc[first], i.e. the return value is another series with a 2-level MultiIndex, whereas I expect a scalar value from the series.

If the first level is not a PeriodIndex, or the MultiIndex has any number of levels other than 3, the problem does not occur.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.10.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: 3.1.0
pip: 8.1.1
setuptools: 28.8.0
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-12-06T00:58:48Z

Weird! Investigation and PR are welcome!

cc @jreback @toobaz

simonariddell · 2018-12-13T06:48:44Z

Following the data initialized in the original post, I've expanded some further findings:

good_key = mi_key = (pd.Period("2018-11-01", freq="B"), "1")
bad_key = (pd.Period("2018-11-01", freq="B"), "1", "1")

lvls = mi_series.index.levels[0]
good_key in lvls
# > False
bad_key in lvls
# > True

In both these cases, due to it being a period index, it hits this dunder method:

pandas/pandas/core/indexes/period.py

Lines 404 to 415 in b7ee829

    
           def __contains__(self, key): 
        
               if isinstance(key, Period): 
        
                   if key.freq != self.freq: 
        
                       return False 
        
                   else: 
        
                       return key.ordinal in self._engine 
        
               else: 
        
                   try: 
        
                       self.get_loc(key) 
        
                       return True 
        
                   except Exception: 
        
                       return False

Then it hits get_loc.

Here we see (similar to the original post) that the 'bad key', which has three levels, does exist in the lvls, when it shouldn't. Looking into it a little deeper the reasoning seems to be due to this block (specifically line 681):

pandas/pandas/core/indexes/period.py

Lines 680 to 690 in b7ee829

    
           try: 
        
               asdt, parsed, reso = parse_time_string(key, self.freq) 
        
               key = asdt 
        
           except TypeError: 
        
               pass 
        
           except DateParseError: 
        
               # A string with invalid format 
        
               raise KeyError("Cannot interpret '{}' as period".format(key)) 
        
           try: 
        
               key = Period(key, freq=self.freq)

For a key that has length != 3, this fails. For example, with good_key in my example above, it raises a value error in the try block, and as a result doesn't run the following line and assign a value to key.

ValueError: not enough values to unpack (expected 3, got 2)

Whereas for bad_key, it succeeds 'unexpectedly'.
If I break into this method and introspect the variables, we can see what's going on

print(asdt)
> Period('2018-11-01', 'B')
print(parsed
> '1'
print(reso)
> '1'

As a result, the key is now set to only Period('2018-11-01', 'B'). In fact, it seems in this case this try block should never be reached, and the fact that it fails on all levels that are not of length three is the correct outcome, but not deliberately expected.

I'd love to hear from someone more familiar with the code base, but it seems the correct first step would be to add a test for this here

pandas/pandas/tests/indexes/period/test_indexing.py

Line 347 in b7ee829

def test_get_loc(self):

And next would be to (potentially) add a conditional on line 680 of periods.py that ensures we aren't passing the wrong type of tuple to parse_time_string. If this seems sensible I can work on it and submit a PR.

toobaz · 2018-12-13T13:56:15Z

@SimonAlecks thanks for your analysis! I don't have much time right now, but tomorrow I will give a more detailed look and report back.

toobaz · 2018-12-14T17:36:24Z

And next would be to (potentially) add a conditional on line 680 of periods.py that ensures we aren't passing the wrong type of tuple to parse_time_string. If this seems sensible I can work on it and submit a PR.

I think your diagnosis is correct; concerning the cure, I think that parse_time_string should definitely raise TypeError if passed a non-string, and after all PeriodIndex.get_loc seems to expect precisely that in the try... except block you isolated. So I would say that

pandas/pandas/_libs/tslibs/parsing.pyx

Line 116 in 040f06f

return arg

should raise a TypeError rather than returning its argument. @SimonAlecks does this make sense to you?

toobaz · 2018-12-14T17:39:55Z

And next would be to (potentially) add a conditional on line 680 of periods.py that ensures we aren't passing the wrong type of tuple to parse_time_string. If this seems sensible I can work on it and submit a PR.

I think your diagnosis is correct; concerning the cure, I think that parse_time_string should definitely raise TypeError if passed a non-string, and after all PeriodIndex.get_loc seems to expect precisely that in the try... except block you isolated. So I would say that

pandas/pandas/_libs/tslibs/parsing.pyx

Lines 113 to 116 in 040f06f

    
           if not isinstance(arg, (str, unicode)): 
        
               # Note: cython recognizes `unicode` in both py2/py3, optimizes 
        
               # this check into a C call. 
        
               return arg

should raise a TypeError rather than returning its argument. @SimonAlecks does this make sense to you?

simonariddell · 2018-12-15T03:59:03Z

Makes perfect sense, thanks @toobaz I'll work on it and submit a PR.

mroeschke · 2021-06-23T05:09:22Z

This looks to work on master now. Could use a test

In [9]: mi_series
Out[9]:
datetime    ONE  TWO
2018-11-01  A    B      1.0
                 C      2.0
2018-11-02  Z    Q      3.0
            W    F      4.0
Name: VALUES, dtype: float64

In [10]: mi_series_2levels = mi_series.reset_index("ONE", drop=True)

In [11]: mi_series.loc[(p_index[0], "A", "B")]
Out[11]: 1.0

jackgoldsmith4 · 2022-06-13T00:18:09Z

take

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Bug labels Dec 6, 2018

simonariddell mentioned this issue Dec 15, 2018

BUG: Fix index bug due to parse_time_string GH24091 #24294

Closed

4 tasks

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 23, 2021

simonjayhawkins added this to the Contributions Welcome milestone Jun 5, 2022

github-actions bot assigned jackgoldsmith4 Jun 13, 2022

jackgoldsmith4 mentioned this issue Jun 19, 2022

Add tests #47420

Merged

3 tasks

phofl closed this as completed in #47420 Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

PAeberhard commented Dec 4, 2018

INSTALLED VERSIONS

gfyoung commented Dec 6, 2018 •

edited

Loading

simonariddell commented Dec 13, 2018

toobaz commented Dec 13, 2018

toobaz commented Dec 14, 2018

toobaz commented Dec 14, 2018

simonariddell commented Dec 15, 2018

mroeschke commented Jun 23, 2021

jackgoldsmith4 commented Jun 13, 2022

BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

Comments

PAeberhard commented Dec 4, 2018

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Dec 6, 2018 • edited Loading

simonariddell commented Dec 13, 2018

toobaz commented Dec 13, 2018

toobaz commented Dec 14, 2018

toobaz commented Dec 14, 2018

simonariddell commented Dec 15, 2018

mroeschke commented Jun 23, 2021

jackgoldsmith4 commented Jun 13, 2022

Output of `pd.show_versions()`

gfyoung commented Dec 6, 2018 •

edited

Loading