Skip to content

ENH: Add optional argument keep_index to dataframe melt method #17459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

NiklasKeck
Copy link

@NiklasKeck NiklasKeck commented Sep 7, 2017

Setting keep_index to True will reuse the original DataFrame index +
names of melted columns as additional level. closes issue #17440

I appreciate any corrections, comments and/or help very much, as this is my first pull request on such a big project. Thank you.

Setting keep_index to True will reuse the original DataFrame index +
names of melted columns as additional level. closes issue pandas-dev#17440
@codecov
Copy link

codecov bot commented Sep 7, 2017

Codecov Report

Merging #17459 into master will decrease coverage by 0.02%.
The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17459      +/-   ##
==========================================
- Coverage   91.15%   91.13%   -0.03%     
==========================================
  Files         163      163              
  Lines       49591    49599       +8     
==========================================
- Hits        45207    45200       -7     
- Misses       4384     4399      +15
Flag Coverage Δ
#multiple 88.91% <33.33%> (-0.02%) ⬇️
#single 40.24% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.72% <ø> (-0.1%) ⬇️
pandas/core/reshape/reshape.py 98.25% <33.33%> (-1.04%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20fee85...0c64bf0. Read the comment docs.

@TomAugspurger
Copy link
Contributor

Thanks.

We'll need tests and docs as well. Test can go in tests/reshape. For the docs, it'll need a whatsnew note in doc/source/whatsnew/v0.21.0.txt, prose docs in doc/source/reshaping.rst. It'd also be nice to have an example in the docstring.

About the implementation, the keep_index keyword doesn't fully describe what we're, since it's "keep the index, and append var_name. I wonder it makes more sense to have a keyword like index={None, 'full', 'original', 'var} (names TBD).

  • None: the current behavior, discard the original index and end with a RangeIndex
  • 'full': original index + the metadata from var_name
  • 'original': the original index
  • 'var': The newly created var

But as I write this, I wonder if the last two would ever be useful? Do we just need a better name than keep_index?

@gfyoung gfyoung added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 7, 2017
@NiklasKeck
Copy link
Author

Thank you for the comments!

I agree that keep_index is not descriptive enough. I find it hard to come up with something short that describes the whole idea within a boolean argument. Using your idea of an index keyword with multiple options looks good.

Maybe rename 'full' to 'append_variables' instead? So the whole options would be:

  • index = {None, ‘original‘, ‘append_variables‘}

index = ‘append_variables‘ would probably be intuitive to understand as index = index + variables

I cannot think of a good usecase for the option 'var', but I started using pandas not long ago, so there might be plenty.

Another idea:

keep_index (boolean) if True:

  • Just keep the original index (append nothing) and let the user decide what to append in a next step to make the index unique.

  • Just keep the original index and append an additional RangeIndex level (the melt_id from issue Index gets lost when DataFrame melt method is used #17440) to ensure uniqueness.

Anyway I would go for @TomAugspurger‘s idea to use a keyword with multiple options.

When we have decided what's best, I will challenge myself with writing tests and documentation :).

@@ -4367,6 +4367,10 @@ def unstack(self, level=-1, fill_value=None):
Name to use for the 'value' column.
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
keep_index : boolean, optional, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is commonly called index=False everywhere else.

add a versionadded

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So better to just name it index and if True resulting in the original index with duplicate entries? What about the option @TomAugspurger proposed?


if keep_index:
orig_index_values = list(np.tile(frame.index.get_values(), K))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite awkward, you have several cases which you need to disambiguate. e.g. if the original is a MI or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jreback for looking over my code and the comment.

I think what I wrote should work with any number of levels.

E. g.

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
idx_multi = pd.MultiIndex.from_tuples(tuples)
idx_single = pd.Index(arrays[0])

# Index
print(list(np.tile(idx_single, 1)))
print(list(np.tile(idx_single, 2)))

# MultiIndex
print(list(np.tile(idx_multi, 1)))
print(list(np.tile(idx_multi, 2)))

But do I have to make it more explicit (= Pythonic)? Or did I miss something else?

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

pls rebase.

@TomAugspurger TomAugspurger mentioned this pull request Oct 30, 2017
2 tasks
@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

closing as stale

@gitgithan
Copy link

@NiklasKeck @TomAugspurger What happened to this pull request? I came from #17440 and wish to contribute.
1st time contributor here, what should i know?
Below is what i currently think i should do

  1. read https://pandas.pydata.org/pandas-docs/stable/contributing.html#committing-your-code,
  2. find out which files/functions need to be changed (how do i find out all the paths a function call can take?)
  3. find out which files/functions the changes can affect
  4. identify their effects and ensure they do not damage existing usability (how is this done?)

Do i have to choose 1 of Travis-CI, Appveyor , or CircleCI to hook onto my github?

@TomAugspurger
Copy link
Contributor

We needed to merge master into this PR to see if the tests still passed.

You can see the changed fils in
https://github.com/pandas-dev/pandas/pull/17459/files

And then run the tests as described in the contributing docs.

You don't have to do anything with the CI services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index gets lost when DataFrame melt method is used
5 participants