Skip to content

BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PAeberhard opened this issue Dec 4, 2018 · 8 comments · Fixed by #47420
Closed

BUG: loc behaves incorrectly on PeriodIndex within MultiIndex with 3 levels #24091

PAeberhard opened this issue Dec 4, 2018 · 8 comments · Fixed by #47420
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@PAeberhard
Copy link

Code Sample

import pandas as pd
pd.show_versions()

p_index = pd.PeriodIndex(["20181101 1100", "20181101 1200", "20181102 1300", "20181102 1400"], name='datetime', freq="B")
mi_series = pd.DataFrame([["A", "B", 1.0],
                          ["A", "C", 2.0],
                          ["Z", "Q", 3.0],
                          ["W", "F", 4.0]], index=p_index, columns=["ONE", "TWO", "VALUES"])
mi_series = mi_series.set_index(["ONE", "TWO"], append=True)["VALUES"]

# mi_series:
# datetime    ONE  TWO
# 2018-11-01  A    B      1.0
#                  C      2.0
# 2018-11-02  Z    Q      3.0
#             W    F      4.0
# Name: VALUES, dtype: float64

mi_series_2levels = mi_series.reset_index("ONE", drop=True)

# Should return a single value:
mi_series.loc[(p_index[0], "A", "B")]
# Expected: 1.0
# Actual:
# ONE  TWO
# A    B      1.0
#      C      2.0
# Name: VALUES, dtype: float64

# Workaround:
assert mi_series.loc[p_index[0]].loc[("A", "B")] == 1.0

# If the MultiIndex has only two levels, it works as expected:
assert mi_series_2levels.loc[(p_index[0], "B")] == 1.0

# Here is probably why:
from pandas.core.tools.datetimes import parse_time_string
mi_key = (pd.Period("2018-11-01", freq="B"), "1", "1")

# This returns a nonsensical tuple:
parse_time_string(mi_key)

# And this returns True when it probably shouldn't:
mi_key in mi_series.index.levels[0]

Problem description

For a Series with a MultiIndex that 1) has exactly 3 levels, and 2) the first level is a PeriodIndex (see examples above), indexing by a triple (series.loc[(first, second, third)]) behaves the same as series.loc[first], i.e. the return value is another series with a 2-level MultiIndex, whereas I expect a scalar value from the series.

If the first level is not a PeriodIndex, or the MultiIndex has any number of levels other than 3, the problem does not occur.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.10.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: 3.1.0
pip: 8.1.1
setuptools: 28.8.0
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Bug labels Dec 6, 2018
@gfyoung
Copy link
Member

gfyoung commented Dec 6, 2018

Weird! Investigation and PR are welcome!

cc @jreback @toobaz

@simonariddell
Copy link
Contributor

Following the data initialized in the original post, I've expanded some further findings:

good_key = mi_key = (pd.Period("2018-11-01", freq="B"), "1")
bad_key = (pd.Period("2018-11-01", freq="B"), "1", "1")

lvls = mi_series.index.levels[0]
good_key in lvls
# > False
bad_key in lvls
# > True

In both these cases, due to it being a period index, it hits this dunder method:

def __contains__(self, key):
if isinstance(key, Period):
if key.freq != self.freq:
return False
else:
return key.ordinal in self._engine
else:
try:
self.get_loc(key)
return True
except Exception:
return False

Then it hits get_loc.

Here we see (similar to the original post) that the 'bad key', which has three levels, does exist in the lvls, when it shouldn't. Looking into it a little deeper the reasoning seems to be due to this block (specifically line 681):

try:
asdt, parsed, reso = parse_time_string(key, self.freq)
key = asdt
except TypeError:
pass
except DateParseError:
# A string with invalid format
raise KeyError("Cannot interpret '{}' as period".format(key))
try:
key = Period(key, freq=self.freq)

For a key that has length != 3, this fails. For example, with good_key in my example above, it raises a value error in the try block, and as a result doesn't run the following line and assign a value to key.

ValueError: not enough values to unpack (expected 3, got 2)

Whereas for bad_key, it succeeds 'unexpectedly'.
If I break into this method and introspect the variables, we can see what's going on

print(asdt)
> Period('2018-11-01', 'B')
print(parsed
> '1'
print(reso)
> '1'

As a result, the key is now set to only Period('2018-11-01', 'B'). In fact, it seems in this case this try block should never be reached, and the fact that it fails on all levels that are not of length three is the correct outcome, but not deliberately expected.

I'd love to hear from someone more familiar with the code base, but it seems the correct first step would be to add a test for this here

def test_get_loc(self):

And next would be to (potentially) add a conditional on line 680 of periods.py that ensures we aren't passing the wrong type of tuple to parse_time_string. If this seems sensible I can work on it and submit a PR.

@toobaz
Copy link
Member

toobaz commented Dec 13, 2018

@SimonAlecks thanks for your analysis! I don't have much time right now, but tomorrow I will give a more detailed look and report back.

@toobaz
Copy link
Member

toobaz commented Dec 14, 2018

And next would be to (potentially) add a conditional on line 680 of periods.py that ensures we aren't passing the wrong type of tuple to parse_time_string. If this seems sensible I can work on it and submit a PR.

I think your diagnosis is correct; concerning the cure, I think that parse_time_string should definitely raise TypeError if passed a non-string, and after all PeriodIndex.get_loc seems to expect precisely that in the try... except block you isolated. So I would say that

should raise a TypeError rather than returning its argument. @SimonAlecks does this make sense to you?

@toobaz
Copy link
Member

toobaz commented Dec 14, 2018

And next would be to (potentially) add a conditional on line 680 of periods.py that ensures we aren't passing the wrong type of tuple to parse_time_string. If this seems sensible I can work on it and submit a PR.

I think your diagnosis is correct; concerning the cure, I think that parse_time_string should definitely raise TypeError if passed a non-string, and after all PeriodIndex.get_loc seems to expect precisely that in the try... except block you isolated. So I would say that

if not isinstance(arg, (str, unicode)):
# Note: cython recognizes `unicode` in both py2/py3, optimizes
# this check into a C call.
return arg

should raise a TypeError rather than returning its argument. @SimonAlecks does this make sense to you?

@simonariddell
Copy link
Contributor

Makes perfect sense, thanks @toobaz I'll work on it and submit a PR.

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [9]: mi_series
Out[9]:
datetime    ONE  TWO
2018-11-01  A    B      1.0
                 C      2.0
2018-11-02  Z    Q      3.0
            W    F      4.0
Name: VALUES, dtype: float64

In [10]: mi_series_2levels = mi_series.reset_index("ONE", drop=True)

In [11]: mi_series.loc[(p_index[0], "A", "B")]
Out[11]: 1.0

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jun 23, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Jun 5, 2022
@jackgoldsmith4
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
7 participants