Skip to content

QST:Why DataFrame.pivot() will change the order of the rows of DataFrame? #39143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
jiahe224 opened this issue Jan 13, 2021 · 11 comments · Fixed by #40954
Closed
2 tasks done

QST:Why DataFrame.pivot() will change the order of the rows of DataFrame? #39143

jiahe224 opened this issue Jan 13, 2021 · 11 comments · Fixed by #40954
Labels
Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Milestone

Comments

@jiahe224
Copy link

jiahe224 commented Jan 13, 2021

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

Note: If you'd still like to submit a question, please read this guide detailing how to provide the necessary information for us to reproduce your question.

# Your code here, if applicable
@jiahe224 jiahe224 added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jan 13, 2021
@phofl
Copy link
Member

phofl commented Jan 13, 2021

Hi, thanks for your question. Could you provide something reproducible?

@jiahe224
Copy link
Author

jiahe224 commented Jan 14, 2021

test = pd.DataFrame({'indus':['12.文教、工美、体育','其中:氯碱','3.黑色金属矿采选业'],'地区':['a','b','c'],'num':[23,21,34],
'year':['2018','2018','2019']})
test_pivot = test.pivot_table(index=['indus','地区'],columns='year',values='num',aggfunc='sum').reset_index()

pivot is like to pivot_table

You will find that the order of the rows of test and test_pivot is not the same. @phofl thanks

@attack68
Copy link
Contributor

with a westernised version of your test I detected no change in rows for the following:

test = pd.DataFrame({'indus': ['d1', 'd2', 'd3'], 'col': ['a', 'b', 'c'], 'num': [23, 21, 34], 'year': ['2018', '2018', '2019']})
test_pivot = test.pivot_table(index=['indus', 'col'], columns='year', values='num', aggfunc='sum').reset_index()
print(test)
print(test_pivot)

  indus col  num  year
0    d1   a   23  2018
1    d2   b   21  2018
2    d3   c   34  2019

year indus col  2018  2019
0       d1   a  23.0   NaN
1       d2   b  21.0   NaN
2       d3   c   NaN  34.0

and a re-order of rows for:

test = pd.DataFrame({'indus': ['d1', 'd4', 'd3'], 'col': ['a', 'b', 'c'], 'num': [23, 21, 34], 'year': ['2018', '2018', '2019']})
test_pivot = test.pivot_table(index=['indus', 'col'], columns='year', values='num', aggfunc='sum').reset_index()
print(test)
print(test_pivot)

  indus col  num  year
0    d1   a   23  2018
1    d4   b   21  2018
2    d3   c   34  2019

year indus col  2018  2019
0       d1   a  23.0   NaN
1       d3   c   NaN  34.0
2       d4   b  21.0   NaN

I will guess that pivot_table does some lexicographical sort on the index?

@phofl
Copy link
Member

phofl commented Jan 17, 2021

This is quite easy to fix, we are currently not passing anything to a groupby which is called unter the hood, which defaults to sort=True then. But we have a lot of tests depending on this behavior, so since this is covered not sure if we can simply change this

@phofl phofl added Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2021
@jiahe224
Copy link
Author

I think like groupby(), it would be better to add the sort parameter, the user can decide the boolean value of sort.

@jiahe224
Copy link
Author

It was definitely reordered, but I didn't see the sort parameter in pivot(), so I guessed it might rely on groupby(), but after I changed the default value of sort in groupby() to False, it still didn't work. @attack68

@attack68
Copy link
Contributor

The example I gave above can be corrected by going to pandas/core/reshape/pivot.py and changing:

115 grouped = data.groupby(keys, observed=observed)

to

115 grouped = data.groupby(keys, observed=observed, sort=False)

Note that changing the pandas/core/frame.py line 6780 to default False in groupby declaration also works.

@phofl
Copy link
Member

phofl commented Jan 22, 2021

Yes but this breaks a lot of tests, hence the question what to do exactly

@attack68
Copy link
Contributor

If you add a sort keyword to pivot table in pivot.py#pivot_table and frame.py#pivot_table defaulting to True it doesn't impact the tests since its backwards compatible no?

The the user can specify pivot_table(sort=False) directly? Or did I miss something - perhaps extension of the api is not ideal?

@rhshadrach
Copy link
Member

cc @jreback - any thoughts on adding an argument sort=True to pivot to control the groupby used?

@phofl
Copy link
Member

phofl commented Apr 6, 2021

Related to #15105

@jreback jreback added this to the 1.3 milestone Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants