Skip to content

ENH: allow index of col names in set_index GH10797 #11944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

StephenKappel
Copy link
Contributor

See #10797

The desired result is to allow slices of the column index to be used in set_index. Like this:

>>> import pandas as pd
>>> data1 = 'x'*5 + 'y'*5
>>> data2 = ['a','b','c','d','e']*2
>>> data3 = range(10)
>>> data_dict = {'Cat': list(data1), 'SubCat': list(data2), 'Vals':data3}
>>> df = pd.DataFrame(data_dict)
>>> ordered_df = df[['Cat', 'SubCat', 'Vals']]
>>> ordered_df.set_index(ordered_df.columns[:2])
            Vals
Cat SubCat      
x   a          0
    b          1
    c          2
    d          3
    e          4
y   a          5
    b          6
    c          7
    d          8
    e          9

However, passing an Index of values to set_index currently has similar behavior to passing an array of values; the index will be used as the values of the new index. As to not break this behavior, this PR only treats the index like a list of column labels when the length of the index is not the same as the length of the DataFrame.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2016

you don't need to open a new PR when changing things , just force push to the old one.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design MultiIndex labels Jan 2, 2016
@jreback jreback added this to the 0.18.0 milestone Jan 2, 2016
@@ -108,6 +108,7 @@ Other enhancements
- A simple version of ``Panel.round()`` is now implemented (:issue:`11763`)
- For Python 3.x, ``round(DataFrame)``, ``round(Series)``, ``round(Panel)`` will work (:issue:`11763`)
- ``Dataframe`` has gained a ``_repr_latex_`` method in order to allow for automatic conversion to latex in a ipython/jupyter notebook using nbconvert. Options ``display.latex.escape`` and ``display.latex.longtable`` have been added to the configuration and are used automatically by the ``to_latex`` method.(:issue:`11778`)
- ``set_index`` now accepts indexes of column labels in the keys parameter (:issue:`10797`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.set_index now accepts list-likes .....

@StephenKappel StephenKappel force-pushed the 10797-col-name-index branch 2 times, most recently from dd1cb29 to 6cac509 Compare January 4, 2016 00:59
@StephenKappel
Copy link
Contributor Author

@jreback - not sure if this approach is better or worse, but I gave it a shot. Instead of checking for specific values in the index, I've written it to check that the array underlying the Index passed as keys is the same array underlying the frame's columns. My thinking is that if the user is passing a slice of the columns index, they almost surely intend to use the Index as a list of existing columns to move to the row index. However, if the Index passed to set_index is constructed independently of columns, the user's intent is almost surely to set that as the index.

I feel like this will give the most intuitive behavior, but we do have the case where two Index objects that are identical (in terms of their values) yield different behavior.

What do you think?

if isinstance(keys, Index):
# if the index is a slice of the column index, treat it like
# a list of column labels; otherwise, treat it like a new index
keys_base = keys.base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a reliable approach, nor should you be using the private .base.

I am not convinced we can do this in a reliable way.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2016

I think that we could allow a passed actual Index to simply set. That is unambiguous and clear. While other list-likes will work as they do now.

Further I think it makes sense to accept (only an Index) on Series. So this will be de-facto:

df.set_index(i) <-> df.index = i

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

@StephenKappel can you update

@StephenKappel
Copy link
Contributor Author

Yes; I will update. Just been tied up for the last week with some other stuff...

@jreback
Copy link
Contributor

jreback commented Jan 11, 2016

@StephenKappel np thanks

@StephenKappel
Copy link
Contributor Author

I don't feel like any solution here is a net improvement. Currently, when passed an Index, set_index() uses the passed index as the new index (ie. df.set_index(i) <-> df.index = i). Changing this behavior seems like a bad idea. The current behavior is used in the Pandas code base and tests (ex. https://github.com/pydata/pandas/blob/master/pandas/io/sas.py#L468 and https://github.com/pydata/pandas/blob/master/pandas/tests/test_multilevel.py#L2242-L2255) and presumably other projects. The only place that this current behavior is undesirable is when a slice of the columns index is being used with the intent of using the Index like a list of column labels.

ordered_df.set_index(ordered_df.columns[:2])

My initial attempts were kludgy hacks that attempted to guess the user's intention based on the characteristics of the passed Index. To avoid any ambiguity, I don't see any good solution other than a keyword argument. But, this doesn't seem like any great improvement over the current solution of having the user coerce the Index to a list before passing it to set_index().

ordered_df.set_index(list(ordered_df.columns[:2]))

vs.

ordered_df.set_index(ordered_df.columns[:2], keys_as_col_labels=True)

Thoughts?

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

@StephenKappel

re-reading, I think this would be extra confusing w/o a different keyword.

Ok, let's instead repurpose this to a doc-example in the set_index section here

and put an example for .set_index(an_index) (you can in fact use the example in the issue).

If its more clear then use a separate sub-section (or maybe put it all in a note).

Further can you enhance the doc-strings with examples like this?

thanks

@jreback jreback added the Docs label Jan 19, 2016
@jreback jreback modified the milestones: Next Major Release, 0.18.0 Jan 30, 2016
@jreback
Copy link
Contributor

jreback commented Mar 12, 2016

pls rebase/update.

@jreback jreback removed this from the Next Major Release milestone Mar 13, 2016
@StephenKappel
Copy link
Contributor Author

Sorry for the delay; I'll update the documentation this week.

@jreback
Copy link
Contributor

jreback commented May 9, 2016

@StephenKappel I am a bit concerned that this is going to be confusing. .set_index by definition has to take names from self and use them to set, rather than taking in a completly new object.

yes it would be convenient to do:

df.set_index(index) rather than df.index = index. But can this be done completely unambiguously?

@StephenKappel
Copy link
Contributor Author

Yes - my understanding was that the current code shouldn't change, but I was going to update the docs to include examples of how the desired column subsetting could be done in the current code (i.e. converting index to a list). I'll remove my code changes and add the doc changes.

@jreback
Copy link
Contributor

jreback commented May 9, 2016

@StephenKappel the problem with docing this is that we have very different behavior for a list and an Index.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

I think this needs a re-think on the API.

@jreback jreback closed this Sep 9, 2016
@jreback jreback mentioned this pull request Jan 12, 2017
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Docs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants