ENH: allow index of col names in set_index GH10797 #11944

StephenKappel · 2016-01-02T22:48:53Z

The desired result is to allow slices of the column index to be used in set_index. Like this:

>>> import pandas as pd
>>> data1 = 'x'*5 + 'y'*5
>>> data2 = ['a','b','c','d','e']*2
>>> data3 = range(10)
>>> data_dict = {'Cat': list(data1), 'SubCat': list(data2), 'Vals':data3}
>>> df = pd.DataFrame(data_dict)
>>> ordered_df = df[['Cat', 'SubCat', 'Vals']]
>>> ordered_df.set_index(ordered_df.columns[:2])
            Vals
Cat SubCat      
x   a          0
    b          1
    c          2
    d          3
    e          4
y   a          5
    b          6
    c          7
    d          8
    e          9

However, passing an Index of values to set_index currently has similar behavior to passing an array of values; the index will be used as the values of the new index. As to not break this behavior, this PR only treats the index like a list of column labels when the length of the index is not the same as the length of the DataFrame.

jreback · 2016-01-02T22:49:56Z

you don't need to open a new PR when changing things , just force push to the old one.

jreback · 2016-01-02T22:51:05Z

doc/source/whatsnew/v0.18.0.txt

@@ -108,6 +108,7 @@ Other enhancements
 - A simple version of ``Panel.round()`` is now implemented (:issue:`11763`)
 - For Python 3.x, ``round(DataFrame)``, ``round(Series)``, ``round(Panel)`` will work (:issue:`11763`)
 - ``Dataframe`` has gained a ``_repr_latex_`` method in order to allow for automatic conversion to latex in a ipython/jupyter notebook using nbconvert. Options ``display.latex.escape`` and ``display.latex.longtable`` have been added to the configuration and are used automatically by the ``to_latex`` method.(:issue:`11778`)
+- ``set_index`` now accepts indexes of column labels in the keys parameter (:issue:`10797`)


.set_index now accepts list-likes .....

…H10797

StephenKappel · 2016-01-04T01:15:21Z

@jreback - not sure if this approach is better or worse, but I gave it a shot. Instead of checking for specific values in the index, I've written it to check that the array underlying the Index passed as keys is the same array underlying the frame's columns. My thinking is that if the user is passing a slice of the columns index, they almost surely intend to use the Index as a list of existing columns to move to the row index. However, if the Index passed to set_index is constructed independently of columns, the user's intent is almost surely to set that as the index.

I feel like this will give the most intuitive behavior, but we do have the case where two Index objects that are identical (in terms of their values) yield different behavior.

What do you think?

jreback · 2016-01-04T01:28:06Z

pandas/core/frame.py

+            if isinstance(keys, Index):
+                # if the index is a slice of the column index, treat it like
+                # a list of column labels; otherwise, treat it like a new index
+                keys_base = keys.base


this is not a reliable approach, nor should you be using the private .base.

I am not convinced we can do this in a reliable way.

jreback · 2016-01-04T01:30:38Z

I think that we could allow a passed actual Index to simply set. That is unambiguous and clear. While other list-likes will work as they do now.

Further I think it makes sense to accept (only an Index) on Series. So this will be de-facto:

df.set_index(i) <-> df.index = i

jreback · 2016-01-11T13:46:48Z

@StephenKappel can you update

StephenKappel · 2016-01-11T13:50:04Z

Yes; I will update. Just been tied up for the last week with some other stuff...

jreback · 2016-01-11T13:55:46Z

@StephenKappel np thanks

StephenKappel · 2016-01-17T23:48:48Z

I don't feel like any solution here is a net improvement. Currently, when passed an Index, set_index() uses the passed index as the new index (ie. df.set_index(i) <-> df.index = i). Changing this behavior seems like a bad idea. The current behavior is used in the Pandas code base and tests (ex. https://github.com/pydata/pandas/blob/master/pandas/io/sas.py#L468 and https://github.com/pydata/pandas/blob/master/pandas/tests/test_multilevel.py#L2242-L2255) and presumably other projects. The only place that this current behavior is undesirable is when a slice of the columns index is being used with the intent of using the Index like a list of column labels.

ordered_df.set_index(ordered_df.columns[:2])

My initial attempts were kludgy hacks that attempted to guess the user's intention based on the characteristics of the passed Index. To avoid any ambiguity, I don't see any good solution other than a keyword argument. But, this doesn't seem like any great improvement over the current solution of having the user coerce the Index to a list before passing it to set_index().

ordered_df.set_index(list(ordered_df.columns[:2]))

vs.

ordered_df.set_index(ordered_df.columns[:2], keys_as_col_labels=True)

Thoughts?

jreback · 2016-01-19T19:57:57Z

@StephenKappel

re-reading, I think this would be extra confusing w/o a different keyword.

Ok, let's instead repurpose this to a doc-example in the set_index section here

and put an example for .set_index(an_index) (you can in fact use the example in the issue).

If its more clear then use a separate sub-section (or maybe put it all in a note).

Further can you enhance the doc-strings with examples like this?

thanks

jreback · 2016-03-12T18:01:58Z

pls rebase/update.

StephenKappel · 2016-05-09T04:04:38Z

Sorry for the delay; I'll update the documentation this week.

jreback · 2016-05-09T13:27:55Z

@StephenKappel I am a bit concerned that this is going to be confusing. .set_index by definition has to take names from self and use them to set, rather than taking in a completly new object.

yes it would be convenient to do:

df.set_index(index) rather than df.index = index. But can this be done completely unambiguously?

StephenKappel · 2016-05-09T13:34:59Z

Yes - my understanding was that the current code shouldn't change, but I was going to update the docs to include examples of how the desired column subsetting could be done in the current code (i.e. converting index to a list). I'll remove my code changes and add the doc changes.

jreback · 2016-05-09T13:38:08Z

@StephenKappel the problem with docing this is that we have very different behavior for a list and an Index.

jreback · 2016-09-09T22:45:16Z

I think this needs a re-think on the API.

jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design MultiIndex labels Jan 2, 2016

jreback added this to the 0.18.0 milestone Jan 2, 2016

jreback reviewed Jan 2, 2016
View reviewed changes

ENH: allow index of col names in set_index GH10797

e98e53d

StephenKappel force-pushed the 10797-col-name-index branch 2 times, most recently from dd1cb29 to 6cac509 Compare January 4, 2016 00:59

ENH: only treat index like col name list when slice of column index G…

557c2ad

…H10797

StephenKappel force-pushed the 10797-col-name-index branch from 6cac509 to 557c2ad Compare January 4, 2016 01:01

jreback reviewed Jan 4, 2016
View reviewed changes

jreback added the Docs label Jan 19, 2016

jreback modified the milestones: Next Major Release, 0.18.0 Jan 30, 2016

jreback removed this from the Next Major Release milestone Mar 13, 2016

jreback closed this Sep 9, 2016

jreback mentioned this pull request Jan 12, 2017

API/ENH: relabel method #15104

Closed

4 tasks

Uh oh!

ENH: allow index of col names in set_index GH10797 #11944

ENH: allow index of col names in set_index GH10797 #11944

Uh oh!

Conversation

StephenKappel commented Jan 2, 2016

Uh oh!

jreback commented Jan 2, 2016

Uh oh!

jreback Jan 2, 2016

Choose a reason for hiding this comment

Uh oh!

StephenKappel commented Jan 4, 2016

Uh oh!

jreback Jan 4, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 4, 2016

Uh oh!

jreback commented Jan 11, 2016

Uh oh!

StephenKappel commented Jan 11, 2016

Uh oh!

jreback commented Jan 11, 2016

Uh oh!

StephenKappel commented Jan 17, 2016

Uh oh!

jreback commented Jan 19, 2016

Uh oh!

jreback commented Mar 12, 2016

Uh oh!

StephenKappel commented May 9, 2016

Uh oh!

jreback commented May 9, 2016

Uh oh!

StephenKappel commented May 9, 2016

Uh oh!

jreback commented May 9, 2016

Uh oh!

jreback commented Sep 9, 2016

Uh oh!

Uh oh!