Skip to content

API: unclear what integer level name references: name or position? #21677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Jun 29, 2018 · 8 comments
Open
Labels
API Design MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

I assumed that in eg get_level_values, stack, unstack, ... (which all boil down to the behaviour of _get_level_number) using an integer to specify the level, would always mean positional because we have no way to disambiguate between position or name in those cases.

But, this seems very inconsistent:

In [40]: mi = pd.MultiIndex.from_product([[0, 1], [2, 3], [4, 5]], names=[1, 1, 2])

In [41]: mi._get_level_number(0)     # <--- positional
Out[41]: 0

In [43]: mi._get_level_number(1)     # <--- positional (as name should raise an error given duplicates)
Out[43]: 1

In [44]: mi._get_level_number(2)     # <--- positional / name is the same
Out[44]: 2

In [45]: mi = pd.MultiIndex.from_product([[0, 1], [2, 3], [4, 5]], names=[2, 1, 0])

In [46]: mi._get_level_number(0)     # <--- name
Out[46]: 2

In [47]: mi._get_level_number(2)     # <--- name
Out[47]: 0

In [48]: mi = pd.MultiIndex.from_product([[0, 1], [2, 3], [4, 5]], names=[2, 1, 1])

In [49]: mi._get_level_number(2)     # <--- name
Out[49]: 0

In [50]: mi._get_level_number(1)     # <--- positional (as name should raise an error given duplicates)
Out[50]: 1

In [51]: mi._get_level_number(0)     # <--- positional
Out[51]: 0

In [52]: mi = pd.MultiIndex.from_product([[0, 1], [2, 3], [4, 5]], names=[1, 0, 1])

In [53]: mi._get_level_number(1)     # <--- positional (as name should raise an error given duplicates)
Out[53]: 1

In [54]: mi._get_level_number(0)     # <--- name
Out[54]: 1

In [55]: mi = pd.MultiIndex.from_product([[0, 1], [2, 3], [4, 5]], names=[0, 0, 1])

In [56]: mi._get_level_number(0)     # <--- positional (as name should raise an error given duplicates)
Out[56]: 0

In [57]: mi._get_level_number(1)     # <--- name
Out[57]: 2

In [58]: mi._get_level_number(2)     # <--- positional

Am I missing something? Was this discussed before?

cc @toobaz

@jreback
Copy link
Contributor

jreback commented Jun 29, 2018

there are some issues that i believe this was discussed

solution of course is to internally have a positional indexer only for most ops and convert s name to s positional (or raise / warm if duplicated or ambiguous from user code)

@jorisvandenbossche
Copy link
Member Author

OK, there is some "logic" in the above of course, which is: try "name", if that errors fall back to "positional" (similar as the logic in ix).

In any case this is not clearly documented in eg get_level_names, as this seems to indicate an integer is always positional:

level : int or str
    ``level`` is either the integer position of the level in the
    MultiIndex, or the name of the level.

We actually have an explicit test to cover the "name" case:

def test_get_level_number_integer(self):
self.index.names = [1, 0]
assert self.index._get_level_number(1) == 0
assert self.index._get_level_number(0) == 1

@jorisvandenbossche
Copy link
Member Author

there are some issues that i believe this was discussed

Yes, I also have that feeling, but couldn't directly find something

solution of course is to internally have a positional indexer only for most ops and convert s name to s positional (or raise / warm if duplicated or ambiguous from user code)

That is already what happens with _get_level_number (the method I show above). Eg in get_level_values, the passed level (position or name) is converted to a position with _get_level_number, and then the code only assumes positional.
But it is this conversion to integer _get_level_number that has some dubious (or not properly documented) logic.

@jorisvandenbossche
Copy link
Member Author

It is touched a bit in some discussion in #7770, #8584, #8809 (but didn't yet read them through)

@toobaz
Copy link
Member

toobaz commented Jun 29, 2018

I think #18872 (comment) is related

@toobaz
Copy link
Member

toobaz commented Jun 29, 2018

Ideally, referring to a level with an integer should always give preference to positional interpretation, because it is unambiguous (e.g. duplicated names are not a problem). And I would avoid (at least in the long term) any kind of fallback. But doing this now will at least require a deprecation cycle.

@mroeschke
Copy link
Member

I discussed this a bit in #17123.

@riedgar-ms
Copy link

I'd be interested in a resolution to this as well. In the library I'm working on, I've fallen back on refusing to accept DataFrames with integer names, to avoid the ambiguity, but that's not a great approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants