BUG:sort_values in groupby make some value lost #35137

leohazy · 2020-07-06T13:10:12Z

[ .] I have checked that this issue has not already been reported.
[ .] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

source data refer to b.csv

import pandas as pd

a=pd.read_csv('https://raw.githubusercontent.com/leohazy/bug/master/b.csv')
def holding0(x):
    x["holding"] = x["shares"].sum()
    print(x[['close_max','close_min']])   
    return x

def holding(x):
    x["holding"] = x["shares"].sum()
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])   
    return x

def holding1(x):
    x["holding"] = x["shares"].sum()
    x['hold_start']=x['hold_start'].apply(lambda x:pd.to_datetime(x,format="%Y-%m-%d"))
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])
    return x

def holding2(x):
    x["holding"] = int(x["shares"].sum())
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])  
    return x

def holding3(x):
    x["holding"] = float(x["shares"].sum())
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])  
    return x

def holding4(x):
    x["holding"] = x["shares"].sum()
    x.sort_values(by=["hold_start"],inplace=True)
    print(x[['close_max','close_min']])  
    return x

def holding5(x):
    x["holding"] = x["shares"].sum()
    x=x.sort_values(by=["hold_start"]).copy()
    print(x[['close_max','close_min']])  
    return x

b = a.groupby(by="companycode").apply(holding)

Problem description

after sort in groupby, several columns got only repeated values, such as ratio,shares,close_max,close_min(columns after 'ratio')

apply with hoding(x) function,the 'close_max' and 'close_min'

result

close_max close_min

33 16.9286 14.5889
43 16.9286 14.5889
13 16.8893 11.9444
23 16.8893 11.9444
53 16.8893 11.9444
close_max close_min
36 16.9286 14.5889
46 16.9286 14.5889
16 16.8893 11.9444
26 16.8893 11.9444
56 16.8893 11.9444
close_max close_min
6 16.8893 11.9444
close_max close_min
4 16.8893 11.9444
close_max close_min
39 16.9286 14.5889
49 16.9286 14.5889
19 16.8893 11.9444
29 16.8893 11.9444
59 16.8893 11.9444
close_max close_min
38 16.9286 14.5889
48 16.9286 14.5889
17 16.8893 11.9444
27 16.8893 11.9444
58 16.8893 11.9444
close_max close_min
5 16.8893 11.9444
close_max close_min
34 16.9286 14.5889
44 16.9286 14.5889
14 16.8893 11.9444
24 16.8893 11.9444
54 16.8893 11.9444
close_max close_min
7 16.8893 11.9444
close_max close_min
32 16.9286 14.5889
42 16.9286 14.5889
12 16.8893 11.9444
22 16.8893 11.9444
52 16.8893 11.9444
close_max close_min
0 16.8893 11.9444
close_max close_min
30 16.9286 14.5889
40 16.9286 14.5889
11 16.8893 11.9444
21 16.8893 11.9444
50 16.8893 11.9444
close_max close_min
31 16.9286 14.5889
41 16.9286 14.5889
10 16.8893 11.9444
20 16.8893 11.9444
51 16.8893 11.9444
close_max close_min
35 16.9286 14.5889
45 16.9286 14.5889
15 16.8893 11.9444
25 16.8893 11.9444
55 16.8893 11.9444
close_max close_min
3 16.8893 11.9444
close_max close_min
8 16.8893 11.9444
close_max close_min
9 16.8893 11.9444
close_max close_min
1 16.8893 11.9444
close_max close_min
37 16.9286 14.5889
47 16.9286 14.5889
18 16.8893 11.9444
28 16.8893 11.9444
57 16.8893 11.9444
close_max close_min
2 16.8893 11.9444

change 'hold_start' into datetime,got expected output

result

close_max close_min
13 16.8893 11.9444
23 16.8893 11.9444
33 16.9286 14.5889
43 16.9286 14.5889
53 16.8893 11.9444
close_max close_min
16 32.56 24.55
26 32.56 24.55
36 30.45 25.90
46 30.45 25.90
56 32.56 24.55
close_max close_min
6 36.6211 24.5361
close_max close_min
4 76.3976 61.7438
close_max close_min
19 38.9398 28.2549
29 38.9398 28.2549
39 25.9433 16.4589
49 25.9433 16.4589
59 38.9398 28.2549
close_max close_min
17 22.4004 14.0188
27 22.4004 14.0188
38 14.8609 10.2045
48 14.8609 10.2045
58 22.4004 14.0188
close_max close_min
5 30.7038 23.6078
close_max close_min
14 120.50 80.3
24 120.50 80.3
34 93.12 70.0
44 93.12 70.0
54 120.50 80.3
close_max close_min
7 39.37 27.12
close_max close_min
12 77.2825 65.6768
22 77.2825 65.6768
32 79.5022 65.4191
42 79.5022 65.4191
52 77.2825 65.6768
close_max close_min
0 34.8514 25.0396
close_max close_min
11 135.105 108.1080
21 135.105 108.1080
30 113.939 98.7243
40 113.939 98.7243
50 135.105 108.1080
close_max close_min
10 1157.48 984.503
20 1157.48 984.503
31 1219.51 1105.090
41 1219.51 1105.090
51 1157.48 984.503
close_max close_min
15 58.1580 43.2304
25 58.1580 43.2304
35 47.4252 39.7685
45 47.4252 39.7685
55 58.1580 43.2304
close_max close_min
3 113.34 91.48
close_max close_min
8 56.0291 40.9621
close_max close_min
9 52.1465 42.1393
close_max close_min
1 31.8664 21.9908
close_max close_min
18 61.8531 48.5440
28 61.8531 48.5440
37 61.8132 47.8151
47 61.8132 47.8151
57 61.8531 48.5440
close_max close_min
2 66.2883 50.5529

change x["shares"].sum() into int,got expected output
change x["shares"].sum() into float,got error
use inplace=true in sort ,got expected output
use copy in sort,got error
use parse_dates=['hold_start'] in read_csv ,got error(against case 1,function holding1(x))

Expected Output

got correct values after sort in groupby apply

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.5.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.5
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: 0.8.1
bs4 : None
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2020-07-06T22:01:55Z

Thanks for reporting this. Could you please make your example self contained, i.e. not depend on an external file? Also, it would be very helpful to make the example as minimal as possible.

leohazy · 2020-07-07T02:41:51Z

Thanks for reporting this. Could you please make your example self contained, i.e. not depend on an external file? Also, it would be very helpful to make the example as minimal as possible.

i sort the values by group keys,and select two groups data into json ,then read the json to pands.

a=pd.read_csv('b.csv')
a=a.sort_values(by='companycode')
a=a[['companycode','close_max','close_min','hold_start','shares']]
a=a.iloc[:10,:]
b=a.to_json(orient='records')
a=pd.read_json(b,orient='records')

so ,you can run the holding codes with json objects below:
b='[{"companycode":1,"close_max":16.8893,"close_min":11.9444,"hold_start":"2020-03-31","shares":8668200.0},{"companycode":1,"close_max":16.9286,"close_min":14.5889,"hold_start":"2019-12-18","shares":5865599.9999999991},{"companycode":1,"close_max":16.9286,"close_min":14.5889,"hold_start":"2019-12-18","shares":5865599.9999999991},{"companycode":1,"close_max":16.8893,"close_min":11.9444,"hold_start":"2020-03-31","shares":7962000.0},{"companycode":1,"close_max":16.8893,"close_min":11.9444,"hold_start":"2020-03-31","shares":862200.0},{"companycode":2,"close_max":32.56,"close_min":24.55,"hold_start":"2020-03-31","shares":3325700.0},{"companycode":2,"close_max":30.45,"close_min":25.9,"hold_start":"2019-12-18","shares":2221900.0},{"companycode":2,"close_max":30.45,"close_min":25.9,"hold_start":"2019-12-18","shares":2221900.0},{"companycode":2,"close_max":32.56,"close_min":24.55,"hold_start":"2020-03-31","shares":3041100.0},{"companycode":2,"close_max":32.56,"close_min":24.55,"hold_start":"2020-03-31","shares":328800.0}]'

rhshadrach · 2020-07-07T21:12:21Z

Thanks - that's helpful. Commenting out the call to sort_values produces the correct output, is that correct? If so, this is then a duplicate of #12653. I recommend you sort the values instead by:

b = a.sort_values(['companycode', 'hold_start'])

leohazy · 2020-07-08T02:49:42Z

Thanks - that's helpful. Commenting out the call to sort_values produces the correct output, is that correct? If so, this is then a duplicate of #12653. I recommend you sort the values instead by:
b = a.sort_values(['companycode', 'hold_start'])

Thank you! it seems caused by apply function in groupby. so can i set x["holding"] = x["shares"].sum()? in fact,i need to add some columns after the sort within the groupby . I'm also curious what happened with the sort in apply, why it works in some ways and not in others.
in #24614 it mentioned "In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects...",but i can't find it in docs (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) now . and in my case,the unexpected values seems repeated from the first group, but it's strange why some columns not effected(stay same, not repeat)

rhshadrach · 2020-07-08T21:16:09Z

You should view modifying a DataFrame/Series within a groupby.apply in the same spirit as modifying a data structure while iterating over it; e.g.:

values = [1, 2, 3, 4, 5]
result = 0
for k, value in enumerate(values):
    result += value
    del values[k]

It may run, but it often does not give the value you expect (9, in the above code).

If you need to perform multiple operations, it is best to separate them out rather than try to do them all at once in a single groupby.apply.

If there are certain operations you're looking to do and aren't sure how to do them, a great place to ask is Stack Overflow

rhshadrach · 2020-07-08T21:19:06Z

Duplicate of #12653

leohazy added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2020

rhshadrach added Groupby Apply Apply, Aggregate, Transform, Map labels Jul 7, 2020

rhshadrach added Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 8, 2020

rhshadrach marked this as a duplicate of #12653 Jul 8, 2020

rhshadrach closed this as completed Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG:sort_values in groupby make some value lost #35137

BUG:sort_values in groupby make some value lost #35137

leohazy commented Jul 6, 2020 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Jul 6, 2020

leohazy commented Jul 7, 2020 •

edited

Loading

rhshadrach commented Jul 7, 2020 •

edited

Loading

leohazy commented Jul 8, 2020

rhshadrach commented Jul 8, 2020

rhshadrach commented Jul 8, 2020

BUG:sort_values in groupby make some value lost #35137

BUG:sort_values in groupby make some value lost #35137

Comments

leohazy commented Jul 6, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

rhshadrach commented Jul 6, 2020

leohazy commented Jul 7, 2020 • edited Loading

rhshadrach commented Jul 7, 2020 • edited Loading

leohazy commented Jul 8, 2020

rhshadrach commented Jul 8, 2020

rhshadrach commented Jul 8, 2020

leohazy commented Jul 6, 2020 •

edited

Loading

Output of `pd.show_versions()`

leohazy commented Jul 7, 2020 •

edited

Loading

rhshadrach commented Jul 7, 2020 •

edited

Loading