Skip to content

CLN/API: wide_to_long or lreshape #15003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Dec 28, 2016 · 10 comments
Closed

CLN/API: wide_to_long or lreshape #15003

jreback opened this issue Dec 28, 2016 · 10 comments
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jreback
Copy link
Contributor

jreback commented Dec 28, 2016

xref #2567

In [27]: data = pd.DataFrame({'hr1': [514, 573], 'hr2': [545, 526],
    ...:                       'team': ['Red Sox', 'Yankees'],
    ...:                       'year1': [2007, 2008], 'year2': [2008, 2008]})
    ...: 

In [28]: data
Out[28]: 
   hr1  hr2     team  year1  year2
0  514  545  Red Sox   2007   2008
1  573  526  Yankees   2008   2008

In [29]: pd.lreshape(data, {'year': ['year1', 'year2'], 'hr': ['hr1', 'hr2']})
Out[29]: 
      team  year   hr
0  Red Sox  2007  514
1  Yankees  2008  573
2  Red Sox  2008  545
3  Yankees  2008  526

In [30]: pd.wide_to_long(data, ['hr', 'year'], 'team', 'index')
Out[30]: 
                hr  year
team    index           
Red Sox 1      514  2007
Yankees 1      573  2008
Red Sox 2      545  2008
Yankees 2      526  2008

So we should drop one of these.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 28, 2016
@jreback jreback added this to the Next Major Release milestone Dec 28, 2016
@jreback
Copy link
Contributor Author

jreback commented Dec 28, 2016

cc @Nuffe

@erikcs
Copy link
Contributor

erikcs commented Dec 28, 2016

Yes having both is redundant, but I think wide_to_long is more flexible?

lreshape does not handle group variables of different length

wide_to_long produces the correct result for lreshape's test case (but that is with dropna=False, which is also the output Stata gives)

I could not make lreshape produce the intented output for all of wide_to_long's test cases. This one, or this, for example.

@tdpetrou
Copy link
Contributor

Is lreshape getting deprecated? There are some SO answers getting a decent amount of upvotes.

@tdpetrou
Copy link
Contributor

@jreback I really like wide_to_long as it's the easiest way to 'simultaneously melt' different sets of columns. It would be nice if the identification variables, i were optional as lreshape is slightly easier when there are no identificaiton variables. Also, it would be good if i were changed to id_vars and j changed to var_name. Maybe this can all be solved if melt were to take a list of lists of columns.

@jreback
Copy link
Contributor Author

jreback commented Aug 24, 2017

@tdpetrou well reducing the API surface area is good. not averse to modifying .melt() to do this. if you have a proposal pls put it up.

@jreback
Copy link
Contributor Author

jreback commented Aug 24, 2017

the is we have 3 functions to do somewhat similar things. happy to consolidate the API. (aside from which documentaiton on lresahpe is nil and wide_to_long not much better)

@tdpetrou
Copy link
Contributor

The simplest addition to melt would be to add functionality to do the simultaneous melting of different sets of columns. I think this would be achievable with the value_vars parameter accepting a list of lists or even a dictionary of lists (like lreshape). I think this would eliminate any use of lreshape.

To add the functionality of pd.wide_to_long, you might have to add three parameters, stubs, sep and suffix, where stubs would be a boolean whether or not the value_vars are stubnames or not.

@erikcs
Copy link
Contributor

erikcs commented Oct 13, 2017

I agree the current configuration is not elegant. I made an earlier PR to wide_to_long to fix some edge cases that where wrong (which I discovered while cleaning a data set) but don't think it fits nicely into a consistent "calculus of data manipulations".

Looking to R and the "tidyverse" they now and then change their API and introduce new "verbs" for existing concepts: before, long was melt, and wide was done with dcast. Now it's gather and spread. In econometrics and statistics, long and wide is the common nomenclature, and is what Stata adheres to. Stata may be a dinosaur, but they are extremely consistent in their API and naming scheme.

Pandas' melt is a copy of Hadley Wickham's melt, which is a modification of base R's reshape (same command name as Stata by the way) with a new name - giving the API a impression of bits and pieces taken from here and there.

I don't really have a good and general proposal for a solution here, more than that IMHO a nomenclature should perhaps be chosen and stuck with.

@tdpetrou
Copy link
Contributor

tdpetrou commented Oct 13, 2017

@erikcs I made a major enhancement to melt in #17677. With that, it can simultaneously melt any number of columns, and supports any kind of multiindex (it had very poor support before that) and handles duplicate column names as well. It also has wide_to_long functionality and with a little more tweaking it will exactly replicate it.

@mroeschke
Copy link
Member

It appears #34314 and #34313 are the more current discussion issues for deprecating one of these issues so closing in favor of those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants