Skip to content

BUG:sort_values in groupby make some value lost #35137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task
leohazy opened this issue Jul 6, 2020 · 6 comments
Closed
1 task

BUG:sort_values in groupby make some value lost #35137

leohazy opened this issue Jul 6, 2020 · 6 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Duplicate Report Duplicate issue or pull request Groupby

Comments

@leohazy
Copy link

leohazy commented Jul 6, 2020

  • [ .] I have checked that this issue has not already been reported.

  • [ .] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

  • source data refer to b.csv
import pandas as pd

a=pd.read_csv('https://raw.githubusercontent.com/leohazy/bug/master/b.csv')
def holding0(x):
    x["holding"] = x["shares"].sum()
    print(x[['close_max','close_min']])   
    return x

def holding(x):
    x["holding"] = x["shares"].sum()
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])   
    return x

def holding1(x):
    x["holding"] = x["shares"].sum()
    x['hold_start']=x['hold_start'].apply(lambda x:pd.to_datetime(x,format="%Y-%m-%d"))
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])
    return x

def holding2(x):
    x["holding"] = int(x["shares"].sum())
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])  
    return x

def holding3(x):
    x["holding"] = float(x["shares"].sum())
    x = x.sort_values(by=["hold_start"])
    print(x[['close_max','close_min']])  
    return x

def holding4(x):
    x["holding"] = x["shares"].sum()
    x.sort_values(by=["hold_start"],inplace=True)
    print(x[['close_max','close_min']])  
    return x

def holding5(x):
    x["holding"] = x["shares"].sum()
    x=x.sort_values(by=["hold_start"]).copy()
    print(x[['close_max','close_min']])  
    return x

b = a.groupby(by="companycode").apply(holding)

Problem description

  • after sort in groupby, several columns got only repeated values, such as ratio,shares,close_max,close_min(columns after 'ratio')
  1. apply with hoding(x) function,the 'close_max' and 'close_min'
result close_max close_min

33 16.9286 14.5889
43 16.9286 14.5889
13 16.8893 11.9444
23 16.8893 11.9444
53 16.8893 11.9444
close_max close_min
36 16.9286 14.5889
46 16.9286 14.5889
16 16.8893 11.9444
26 16.8893 11.9444
56 16.8893 11.9444
close_max close_min
6 16.8893 11.9444
close_max close_min
4 16.8893 11.9444
close_max close_min
39 16.9286 14.5889
49 16.9286 14.5889
19 16.8893 11.9444
29 16.8893 11.9444
59 16.8893 11.9444
close_max close_min
38 16.9286 14.5889
48 16.9286 14.5889
17 16.8893 11.9444
27 16.8893 11.9444
58 16.8893 11.9444
close_max close_min
5 16.8893 11.9444
close_max close_min
34 16.9286 14.5889
44 16.9286 14.5889
14 16.8893 11.9444
24 16.8893 11.9444
54 16.8893 11.9444
close_max close_min
7 16.8893 11.9444
close_max close_min
32 16.9286 14.5889
42 16.9286 14.5889
12 16.8893 11.9444
22 16.8893 11.9444
52 16.8893 11.9444
close_max close_min
0 16.8893 11.9444
close_max close_min
30 16.9286 14.5889
40 16.9286 14.5889
11 16.8893 11.9444
21 16.8893 11.9444
50 16.8893 11.9444
close_max close_min
31 16.9286 14.5889
41 16.9286 14.5889
10 16.8893 11.9444
20 16.8893 11.9444
51 16.8893 11.9444
close_max close_min
35 16.9286 14.5889
45 16.9286 14.5889
15 16.8893 11.9444
25 16.8893 11.9444
55 16.8893 11.9444
close_max close_min
3 16.8893 11.9444
close_max close_min
8 16.8893 11.9444
close_max close_min
9 16.8893 11.9444
close_max close_min
1 16.8893 11.9444
close_max close_min
37 16.9286 14.5889
47 16.9286 14.5889
18 16.8893 11.9444
28 16.8893 11.9444
57 16.8893 11.9444
close_max close_min
2 16.8893 11.9444

  1. change 'hold_start' into datetime,got expected output
result

close_max close_min
13 16.8893 11.9444
23 16.8893 11.9444
33 16.9286 14.5889
43 16.9286 14.5889
53 16.8893 11.9444
close_max close_min
16 32.56 24.55
26 32.56 24.55
36 30.45 25.90
46 30.45 25.90
56 32.56 24.55
close_max close_min
6 36.6211 24.5361
close_max close_min
4 76.3976 61.7438
close_max close_min
19 38.9398 28.2549
29 38.9398 28.2549
39 25.9433 16.4589
49 25.9433 16.4589
59 38.9398 28.2549
close_max close_min
17 22.4004 14.0188
27 22.4004 14.0188
38 14.8609 10.2045
48 14.8609 10.2045
58 22.4004 14.0188
close_max close_min
5 30.7038 23.6078
close_max close_min
14 120.50 80.3
24 120.50 80.3
34 93.12 70.0
44 93.12 70.0
54 120.50 80.3
close_max close_min
7 39.37 27.12
close_max close_min
12 77.2825 65.6768
22 77.2825 65.6768
32 79.5022 65.4191
42 79.5022 65.4191
52 77.2825 65.6768
close_max close_min
0 34.8514 25.0396
close_max close_min
11 135.105 108.1080
21 135.105 108.1080
30 113.939 98.7243
40 113.939 98.7243
50 135.105 108.1080
close_max close_min
10 1157.48 984.503
20 1157.48 984.503
31 1219.51 1105.090
41 1219.51 1105.090
51 1157.48 984.503
close_max close_min
15 58.1580 43.2304
25 58.1580 43.2304
35 47.4252 39.7685
45 47.4252 39.7685
55 58.1580 43.2304
close_max close_min
3 113.34 91.48
close_max close_min
8 56.0291 40.9621
close_max close_min
9 52.1465 42.1393
close_max close_min
1 31.8664 21.9908
close_max close_min
18 61.8531 48.5440
28 61.8531 48.5440
37 61.8132 47.8151
47 61.8132 47.8151
57 61.8531 48.5440
close_max close_min
2 66.2883 50.5529

  1. change x["shares"].sum() into int,got expected output
  2. change x["shares"].sum() into float,got error
  3. use inplace=true in sort ,got expected output
  4. use copy in sort,got error
  5. use parse_dates=['hold_start'] in read_csv ,got error(against case 1,function holding1(x))

Expected Output

got correct values after sort in groupby apply

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.5.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.5
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: 0.8.1
bs4 : None
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@leohazy leohazy added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2020
@rhshadrach
Copy link
Member

Thanks for reporting this. Could you please make your example self contained, i.e. not depend on an external file? Also, it would be very helpful to make the example as minimal as possible.

@leohazy
Copy link
Author

leohazy commented Jul 7, 2020

Thanks for reporting this. Could you please make your example self contained, i.e. not depend on an external file? Also, it would be very helpful to make the example as minimal as possible.

i sort the values by group keys,and select two groups data into json ,then read the json to pands.

a=pd.read_csv('b.csv')
a=a.sort_values(by='companycode')
a=a[['companycode','close_max','close_min','hold_start','shares']]
a=a.iloc[:10,:]
b=a.to_json(orient='records')
a=pd.read_json(b,orient='records')

so ,you can run the holding codes with json objects below:
b='[{"companycode":1,"close_max":16.8893,"close_min":11.9444,"hold_start":"2020-03-31","shares":8668200.0},{"companycode":1,"close_max":16.9286,"close_min":14.5889,"hold_start":"2019-12-18","shares":5865599.9999999991},{"companycode":1,"close_max":16.9286,"close_min":14.5889,"hold_start":"2019-12-18","shares":5865599.9999999991},{"companycode":1,"close_max":16.8893,"close_min":11.9444,"hold_start":"2020-03-31","shares":7962000.0},{"companycode":1,"close_max":16.8893,"close_min":11.9444,"hold_start":"2020-03-31","shares":862200.0},{"companycode":2,"close_max":32.56,"close_min":24.55,"hold_start":"2020-03-31","shares":3325700.0},{"companycode":2,"close_max":30.45,"close_min":25.9,"hold_start":"2019-12-18","shares":2221900.0},{"companycode":2,"close_max":30.45,"close_min":25.9,"hold_start":"2019-12-18","shares":2221900.0},{"companycode":2,"close_max":32.56,"close_min":24.55,"hold_start":"2020-03-31","shares":3041100.0},{"companycode":2,"close_max":32.56,"close_min":24.55,"hold_start":"2020-03-31","shares":328800.0}]'

@rhshadrach
Copy link
Member

rhshadrach commented Jul 7, 2020

Thanks - that's helpful. Commenting out the call to sort_values produces the correct output, is that correct? If so, this is then a duplicate of #12653. I recommend you sort the values instead by:

b = a.sort_values(['companycode', 'hold_start'])

@rhshadrach rhshadrach added Groupby Apply Apply, Aggregate, Transform, Map labels Jul 7, 2020
@leohazy
Copy link
Author

leohazy commented Jul 8, 2020

Thanks - that's helpful. Commenting out the call to sort_values produces the correct output, is that correct? If so, this is then a duplicate of #12653. I recommend you sort the values instead by:

b = a.sort_values(['companycode', 'hold_start'])

Thank you! it seems caused by apply function in groupby. so can i set x["holding"] = x["shares"].sum()? in fact,i need to add some columns after the sort within the groupby . I'm also curious what happened with the sort in apply, why it works in some ways and not in others.
in #24614 it mentioned "In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects...",but i can't find it in docs (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) now . and in my case,the unexpected values seems repeated from the first group, but it's strange why some columns not effected(stay same, not repeat)

@rhshadrach
Copy link
Member

You should view modifying a DataFrame/Series within a groupby.apply in the same spirit as modifying a data structure while iterating over it; e.g.:

values = [1, 2, 3, 4, 5]
result = 0
for k, value in enumerate(values):
    result += value
    del values[k]

It may run, but it often does not give the value you expect (9, in the above code).

If you need to perform multiple operations, it is best to separate them out rather than try to do them all at once in a single groupby.apply.

If there are certain operations you're looking to do and aren't sure how to do them, a great place to ask is Stack Overflow

@rhshadrach rhshadrach added Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 8, 2020
@rhshadrach
Copy link
Member

Duplicate of #12653

@rhshadrach rhshadrach marked this as a duplicate of #12653 Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

2 participants