Rethinking `to_flat_index` on flat `Index` #23670

toobaz · 2018-11-13T17:29:10Z

With reference to #22866, and to a comment I made ( #22866 (comment) ) which I thought was paranoid and maybe is not.

Three considerations:

to_flat_index is the only method (the only operation in pandas, actually) I can think of which gives a different result when called on a flat Index and when called on an equivalent 1-level MultiIndex: you obtain the Index itself (each item being assumingly a scalar) in the first case and an Index of length 1 tuples in the second case
if one needs, for any reason, an Index of length-one scalar, there is no simple way to obtain it. Or if you prefer: if we add some sep or analogous arguments (e.g. fmt) to to_flat_index, then it is not going to work on a flat Index (at least with the current implementation). In general, to_flat_index is a method which makes sense also because it can be combined with, for instance .map. When doing so, it is good to know for sure that the arguments received by the callable are always tuples.
Index.to_flat_index() as it is now is idempotent and pretty useless: it makes sense for compatibility... but then compatibility is a priority!

These all lead me to conclude: shouldn't pd.Index.to_flat_index() return an Index of length-1 tuples?

Code Sample, a copy-pastable example if possible

In [2]: pd.Index([1, 2, 3,]).to_flat_index()
Out[2]: Int64Index([1, 2, 3], dtype='int64')

In [3]: pd.MultiIndex.from_arrays([pd.Index([1, 2, 3,])]).to_flat_index()
Out[3]: Index([(1,), (2,), (3,)], dtype='object')

Expected Output

Out[2] should be equal to Out[3]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: 454ecfc
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-8-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.24.0.dev0+995.g454ecfc61
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.2.2.post1634.dev0+ge8120cf6d
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-11-13T17:33:28Z

Hmm not sure I agree. Might just be my personal usage bias but why would someone opt for a 1-D MultiIndex in the first place?

I ultimately foresee most people using sep or the fmt argument you proposed with this method rather than creating an Index of tuples at the end of the day

toobaz · 2018-11-13T17:57:18Z

Hmm not sure I agree. Might just be my personal usage bias but why would someone opt for a 1-D MultiIndex in the first place?

I don't have in mind a killer application, but my answer to your question is "for the same reason why people would want to call pd.Index.to_flat_index(): compatibility"...

toobaz · 2018-11-13T17:59:54Z

I ultimately foresee most people using sep or the fmt argument you proposed with this method rather than creating an Index of tuples at the end of the day

... and explaining that .to_flat_index(fmt=callable) is just syntactic sugar for .to_flat_index().map(lambda x : callable(*x)) is I think a nice and concise description - which is currently false for pd.Index.

h-vetinari · 2018-11-13T18:06:30Z

[...] only method [...] I can think of which gives a different result when called on a flat Index and when called on an equivalent 1-level MultiIndex

FWIW, .str.cat works for Index, but not a 1-level MultiIndex, which I guess qualifies as a different result. ;-)

There's probably many more broken methods for 1-level MultiIndex in the .str accessor (the .str-constructor passes for 1-level MultiIndex, but the methods behind it are completely untested for this case). My current idea/plan is to disable .str for MultiIndex in #23167 (which is in the process of being split into several PRs).

>>> pd.Index(['a', 'b', 'c']).str.cat()
'abc'
>>> pd.MultiIndex.from_arrays([pd.Index(['a', 'b', 'c'])]).str.cat()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Miniconda3\envs\pf-dev\lib\site-packages\pandas\core\strings.py", line 2275, in cat
    data = Series(self._orig, index=self._orig)
  File "C:\ProgramData\Miniconda3\envs\pf-dev\lib\site-packages\pandas\core\series.py", line 191, in __init__
    raise NotImplementedError("initializing a Series from a "
NotImplementedError: initializing a Series from a MultiIndex is not supported

toobaz · 2018-11-13T18:36:57Z

FWIW, .str.cat works for Index, but not a 1-level MultiIndex, which I guess qualifies as a different result. ;-)

Right, thanks! But this is clearly a bug... (because at least some other .str methods work...)

My current idea/plan is to disable .str for MultiIndex in #23167

Sorry, naive question, but what is the problem with just running .str on the result of self.get_level_values(0)?

More in general, differences between pd.Index and 1-level pd.MultiIndex could be of two kinds:

not implemented features of one or the other
consequences of design decisions

I think there are no differences of the second kind. If we have, and plan to keep, differences of the first kind (and not considering them as "bug"), then I wonder whether we should ban 1-level pd.MultiIndex altogether.

(Notice that if we did that, I would even more convinced of the current issue, where we pretend that pd.Index behaves like a 1-level pd.MultiIndex)

h-vetinari · 2018-11-13T19:17:38Z

Sorry, naive question, but what is the problem with just running .str on the result of self.get_level_values(0)?

Fair point, will most likely use that. Just didn't bother investigating further when I saw those were mostly broken. Only question then would be the following: Currently, Index.str.<method> always returns an Index. It'd be easy to return the same for 1-level MultiIndex, but would that change (from MI->Index) be confusing/surprising from a user-POV? I guess this goes towards your second point.

[...] then I wonder whether we should ban 1-level pd.MultiIndex altogether.

I admit I don't have the full implications in view, but would be sympathetic (+0.25) to always turn 1-level MI to regular Index where necessary.

(Notice that if we did that, I would even more convinced of the current issue, where we pretend that pd.Index behaves like a 1-level pd.MultiIndex)

I may be missing something, but if that ban should be enforced, then this issue goes away, no?
It would be much more strange (IMO) to have:

In [2]: pd.Index([1, 2, 3,]).to_flat_index()
Out[2]: Index([(1,), (2,), (3,)], dtype='object')

toobaz · 2018-11-13T21:29:06Z

I may be missing something, but if that ban should be enforced, then this issue goes away, no?
It would be much more strange (IMO) to have:

On the contrary, I would think it is even more important. to_flat_index always produces an index of tuples, except on Index. As of now, we can say (e.g. to a user whose callable expects an iterable argument) "there is the 1-level MultiIndex for that". If we decide to replace all 1-level MultiIndexes with flat Indexes, then we really want Index to behave, on any "assumingly multilevel operation", like a 1-level MultiIndex does now.

h-vetinari · 2018-11-14T00:27:38Z

then we really want Index to behave, on any "assumingly multilevel operation", like a 1-level MultiIndex does now.

This sounds like an argument against abolishing 1-level MultiIndex to me -- what I mean is that IMO, "assumingly multilevel operation" do not have to work on an Index at all (per se).

toobaz · 2018-11-14T08:13:27Z

This sounds like an argument against abolishing 1-level MultiIndex to me -- what I mean is that IMO, "assumingly _multi_level operation" do not have to work on an Index at all (per se).

That we do tend to "backport" (all?) MultiIndex methods to flat Index is a fact, right? And I think that's a good thing. That's what I had in mind with "assumingly multilevel"...

closes pandas-dev#23670

WillAyd · 2019-02-02T23:58:18Z

Giving this some more thought after the PR you posted I am +/- 0 here. Part of me thinks consistency is great in that regardless of whether an Index or MultiIndex calls this that you'd get a tuplized version of the index, but from a practicality perspective why do you think that is beneficial?

My assumption (potentially wrong) is that users would still have to be aware of the labels contained within the calling object since they'd be indexing afterwards, and moving from a scalar to a tuple there feels weird, though maybe it's weird for them to ever call this in the first place then...

toobaz · 2019-02-03T10:05:49Z

from a practicality perspective why do you think that is beneficial?

I think it is close to irrelevant (I'm insisting on this only i) for consistency/robust code ii) because the thing is brand new so I want to solve this soon), but if I do try to think to some real world case, it's going to be something like:

def anonymize(idx, hashfunc):
    """
    Replace each label in a (multi-)index with its hashfunc-ed version.
    """
    new_cont = [tuple([hashfunc(l) for l in t]) for t in idx.to_flat_index()]
    return pd.Index(new_cont, tupleize_cols=True)

... which currently will break on flat indexes.

(Not claiming my example is particularly well coded, there might be better ones)

On the other hand, I'm pretty sure we don't loose anything from subtracting the user another alias for lambda self : self (which is the current behavior on flat indexes)!

closes pandas-dev#23670

toobaz added MultiIndex Needs Discussion Requires discussion from core team before further action labels Nov 13, 2018

h-vetinari mentioned this issue Nov 13, 2018

API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

Merged

3 tasks

h-vetinari mentioned this issue Nov 14, 2018

API/BUG: allow .str-accessor for 1-level MultiIndex? Return what? #23679

Closed

toobaz added a commit to toobaz/pandas that referenced this issue Feb 2, 2019

API: harmonize Index.to_flat_index to MultiIndex.to_flat_index

9070163

closes pandas-dev#23670

toobaz mentioned this issue Feb 2, 2019

API: harmonize Index.to_flat_index to MultiIndex.to_flat_index #25094

Closed

4 tasks

toobaz added a commit to toobaz/pandas that referenced this issue Feb 7, 2019

API: harmonize Index.to_flat_index to MultiIndex.to_flat_index

f7935c2

closes pandas-dev#23670

mroeschke added the Bug label Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking `to_flat_index` on flat `Index` #23670

Rethinking `to_flat_index` on flat `Index` #23670

toobaz commented Nov 13, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Nov 13, 2018

toobaz commented Nov 13, 2018

toobaz commented Nov 13, 2018

h-vetinari commented Nov 13, 2018

toobaz commented Nov 13, 2018 •

edited

Loading

h-vetinari commented Nov 13, 2018

toobaz commented Nov 13, 2018

h-vetinari commented Nov 14, 2018 •

edited

Loading

toobaz commented Nov 14, 2018 •

edited

Loading

WillAyd commented Feb 2, 2019

toobaz commented Feb 3, 2019 •

edited

Loading

Rethinking to_flat_index on flat Index #23670

Rethinking to_flat_index on flat Index #23670

Comments

toobaz commented Nov 13, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Nov 13, 2018

toobaz commented Nov 13, 2018

toobaz commented Nov 13, 2018

h-vetinari commented Nov 13, 2018

toobaz commented Nov 13, 2018 • edited Loading

h-vetinari commented Nov 13, 2018

toobaz commented Nov 13, 2018

h-vetinari commented Nov 14, 2018 • edited Loading

toobaz commented Nov 14, 2018 • edited Loading

WillAyd commented Feb 2, 2019

toobaz commented Feb 3, 2019 • edited Loading

Rethinking `to_flat_index` on flat `Index` #23670

Rethinking `to_flat_index` on flat `Index` #23670

toobaz commented Nov 13, 2018 •

edited

Loading

Output of `pd.show_versions()`

toobaz commented Nov 13, 2018 •

edited

Loading

h-vetinari commented Nov 14, 2018 •

edited

Loading

toobaz commented Nov 14, 2018 •

edited

Loading

toobaz commented Feb 3, 2019 •

edited

Loading