Skip to content

Rethinking to_flat_index on flat Index #23670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toobaz opened this issue Nov 13, 2018 · 11 comments
Open

Rethinking to_flat_index on flat Index #23670

toobaz opened this issue Nov 13, 2018 · 11 comments
Labels
Bug MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@toobaz
Copy link
Member

toobaz commented Nov 13, 2018

With reference to #22866, and to a comment I made ( #22866 (comment) ) which I thought was paranoid and maybe is not.

Three considerations:

  1. to_flat_index is the only method (the only operation in pandas, actually) I can think of which gives a different result when called on a flat Index and when called on an equivalent 1-level MultiIndex: you obtain the Index itself (each item being assumingly a scalar) in the first case and an Index of length 1 tuples in the second case
  2. if one needs, for any reason, an Index of length-one scalar, there is no simple way to obtain it. Or if you prefer: if we add some sep or analogous arguments (e.g. fmt) to to_flat_index, then it is not going to work on a flat Index (at least with the current implementation). In general, to_flat_index is a method which makes sense also because it can be combined with, for instance .map. When doing so, it is good to know for sure that the arguments received by the callable are always tuples.
  3. Index.to_flat_index() as it is now is idempotent and pretty useless: it makes sense for compatibility... but then compatibility is a priority!

These all lead me to conclude: shouldn't pd.Index.to_flat_index() return an Index of length-1 tuples?

@WillAyd

Code Sample, a copy-pastable example if possible

In [2]: pd.Index([1, 2, 3,]).to_flat_index()
Out[2]: Int64Index([1, 2, 3], dtype='int64')

In [3]: pd.MultiIndex.from_arrays([pd.Index([1, 2, 3,])]).to_flat_index()
Out[3]: Index([(1,), (2,), (3,)], dtype='object')

Expected Output

Out[2] should be equal to Out[3]

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 454ecfc
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-8-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.24.0.dev0+995.g454ecfc61
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.2.2.post1634.dev0+ge8120cf6d
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1
gcsfs: None

@toobaz toobaz added MultiIndex Needs Discussion Requires discussion from core team before further action labels Nov 13, 2018
@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2018

Hmm not sure I agree. Might just be my personal usage bias but why would someone opt for a 1-D MultiIndex in the first place?

I ultimately foresee most people using sep or the fmt argument you proposed with this method rather than creating an Index of tuples at the end of the day

@toobaz
Copy link
Member Author

toobaz commented Nov 13, 2018

Hmm not sure I agree. Might just be my personal usage bias but why would someone opt for a 1-D MultiIndex in the first place?

I don't have in mind a killer application, but my answer to your question is "for the same reason why people would want to call pd.Index.to_flat_index(): compatibility"...

@toobaz
Copy link
Member Author

toobaz commented Nov 13, 2018

I ultimately foresee most people using sep or the fmt argument you proposed with this method rather than creating an Index of tuples at the end of the day

... and explaining that .to_flat_index(fmt=callable) is just syntactic sugar for .to_flat_index().map(lambda x : callable(*x)) is I think a nice and concise description - which is currently false for pd.Index.

@h-vetinari
Copy link
Contributor

[...] only method [...] I can think of which gives a different result when called on a flat Index and when called on an equivalent 1-level MultiIndex

FWIW, .str.cat works for Index, but not a 1-level MultiIndex, which I guess qualifies as a different result. ;-)

There's probably many more broken methods for 1-level MultiIndex in the .str accessor (the .str-constructor passes for 1-level MultiIndex, but the methods behind it are completely untested for this case). My current idea/plan is to disable .str for MultiIndex in #23167 (which is in the process of being split into several PRs).

>>> pd.Index(['a', 'b', 'c']).str.cat()
'abc'
>>> pd.MultiIndex.from_arrays([pd.Index(['a', 'b', 'c'])]).str.cat()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Miniconda3\envs\pf-dev\lib\site-packages\pandas\core\strings.py", line 2275, in cat
    data = Series(self._orig, index=self._orig)
  File "C:\ProgramData\Miniconda3\envs\pf-dev\lib\site-packages\pandas\core\series.py", line 191, in __init__
    raise NotImplementedError("initializing a Series from a "
NotImplementedError: initializing a Series from a MultiIndex is not supported

@toobaz
Copy link
Member Author

toobaz commented Nov 13, 2018

FWIW, .str.cat works for Index, but not a 1-level MultiIndex, which I guess qualifies as a different result. ;-)

Right, thanks! But this is clearly a bug... (because at least some other .str methods work...)

My current idea/plan is to disable .str for MultiIndex in #23167

Sorry, naive question, but what is the problem with just running .str on the result of self.get_level_values(0)?

More in general, differences between pd.Index and 1-level pd.MultiIndex could be of two kinds:

  • not implemented features of one or the other
  • consequences of design decisions

I think there are no differences of the second kind. If we have, and plan to keep, differences of the first kind (and not considering them as "bug"), then I wonder whether we should ban 1-level pd.MultiIndex altogether.

(Notice that if we did that, I would even more convinced of the current issue, where we pretend that pd.Index behaves like a 1-level pd.MultiIndex)

@h-vetinari
Copy link
Contributor

Sorry, naive question, but what is the problem with just running .str on the result of self.get_level_values(0)?

Fair point, will most likely use that. Just didn't bother investigating further when I saw those were mostly broken. Only question then would be the following: Currently, Index.str.<method> always returns an Index. It'd be easy to return the same for 1-level MultiIndex, but would that change (from MI->Index) be confusing/surprising from a user-POV? I guess this goes towards your second point.

[...] then I wonder whether we should ban 1-level pd.MultiIndex altogether.

I admit I don't have the full implications in view, but would be sympathetic (+0.25) to always turn 1-level MI to regular Index where necessary.

(Notice that if we did that, I would even more convinced of the current issue, where we pretend that pd.Index behaves like a 1-level pd.MultiIndex)

I may be missing something, but if that ban should be enforced, then this issue goes away, no?
It would be much more strange (IMO) to have:

In [2]: pd.Index([1, 2, 3,]).to_flat_index()
Out[2]: Index([(1,), (2,), (3,)], dtype='object')

@toobaz
Copy link
Member Author

toobaz commented Nov 13, 2018

I may be missing something, but if that ban should be enforced, then this issue goes away, no?
It would be much more strange (IMO) to have:

On the contrary, I would think it is even more important. to_flat_index always produces an index of tuples, except on Index. As of now, we can say (e.g. to a user whose callable expects an iterable argument) "there is the 1-level MultiIndex for that". If we decide to replace all 1-level MultiIndexes with flat Indexes, then we really want Index to behave, on any "assumingly multilevel operation", like a 1-level MultiIndex does now.

@h-vetinari
Copy link
Contributor

h-vetinari commented Nov 14, 2018

then we really want Index to behave, on any "assumingly multilevel operation", like a 1-level MultiIndex does now.

This sounds like an argument against abolishing 1-level MultiIndex to me -- what I mean is that IMO, "assumingly multilevel operation" do not have to work on an Index at all (per se).

@toobaz
Copy link
Member Author

toobaz commented Nov 14, 2018

This sounds like an argument against abolishing 1-level MultiIndex to me -- what I mean is that IMO, "assumingly _multi_level operation" do not have to work on an Index at all (per se).

That we do tend to "backport" (all?) MultiIndex methods to flat Index is a fact, right? And I think that's a good thing. That's what I had in mind with "assumingly multilevel"...

@WillAyd
Copy link
Member

WillAyd commented Feb 2, 2019

Giving this some more thought after the PR you posted I am +/- 0 here. Part of me thinks consistency is great in that regardless of whether an Index or MultiIndex calls this that you'd get a tuplized version of the index, but from a practicality perspective why do you think that is beneficial?

My assumption (potentially wrong) is that users would still have to be aware of the labels contained within the calling object since they'd be indexing afterwards, and moving from a scalar to a tuple there feels weird, though maybe it's weird for them to ever call this in the first place then...

@toobaz
Copy link
Member Author

toobaz commented Feb 3, 2019

from a practicality perspective why do you think that is beneficial?

I think it is close to irrelevant (I'm insisting on this only i) for consistency/robust code ii) because the thing is brand new so I want to solve this soon), but if I do try to think to some real world case, it's going to be something like:

def anonymize(idx, hashfunc):
    """
    Replace each label in a (multi-)index with its hashfunc-ed version.
    """
    new_cont = [tuple([hashfunc(l) for l in t]) for t in idx.to_flat_index()]
    return pd.Index(new_cont, tupleize_cols=True)

... which currently will break on flat indexes.

(Not claiming my example is particularly well coded, there might be better ones)

On the other hand, I'm pretty sure we don't loose anything from subtracting the user another alias for lambda self : self (which is the current behavior on flat indexes)!

toobaz added a commit to toobaz/pandas that referenced this issue Feb 7, 2019
@mroeschke mroeschke added the Bug label Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants