Skip to content

BUG: Using pd.to_datetime function in groupby.apply function causes key value error #44026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
apache-chnsys opened this issue Oct 14, 2021 · 10 comments
Closed
3 tasks done
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@apache-chnsys
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

def test_apply(data):
    print(data)
    data['date'] = pd.to_datetime(data['date'])

dt1 = pd.DataFrame({
                    'key': ['aa', 'bb', 'cc', 'dd'],
                    'date': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01'],
                    'qty': [1.0, 1.0, 3.0, 3.0]})
dt1.groupby("key").apply(lambda x: test_apply(x))

Issue Description

在“test_apply”中对输入参数data中类型为object的字段'date'使用pd.to_datetime()函数之后,在test_apply函数中循环打印lambda函数中的x,期望结果为每次打印key值不同,分别为“aa”,"bb","cc","dd",但实际结果为"aa","aa","aa","aa"。
After using the pd.to_datetime() function in the "test_apply" field "date" with the type of object in the input parameter data, print the x in the lambda function in the test_apply function in a loop. The expected result is that the key value is different each time it is printed, respectively "aa","bb","cc","dd", but the actual result is "aa","aa","aa","aa".
key date qty
0 aa 2020-01-01 1.0
key date qty
1 aa 2020-01-01 1.0
key date qty
2 aa 2020-01-01 3.0
key date qty
3 aa 2020-01-01 3.0

Expected Behavior

key date qty
0 aa 2020-01-01 1.0
key date qty
1 bb 2020-01-01 1.0
key date qty
2 cc 2020-01-01 3.0
key date qty
3 dd 2020-01-01 3.0

Installed Versions

python : 3.6.5.final.0 pandas : 1.1.5
@apache-chnsys apache-chnsys added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 14, 2021
@apache-chnsys apache-chnsys changed the title BUG: BUG: Using pd.to_datetime function in groupby.apply function causes key value error Oct 14, 2021
@manu-prakash-choudhary
Copy link

Well I guess you are facing this issue cause you are not having a return value in your defined function of test_apply since pandas.apply() do corresponding mapping and you didn't returned anything therefore I believe that it is giving you the same key value pair. Also in you expected behaviour date is same but that will not be true with dt1.apply() function

import pandas as pd

def test_apply(data):
    print(data)
    data['date'] = pd.to_datetime(data['date'])
    return data

dt1 = pd.DataFrame({
                    'key': ['aa', 'bb', 'cc', 'dd'],
                    'date': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01'],
                    'qty': [1.0, 1.0, 3.0, 3.0]})
dt2 = dt1.groupby("key").apply(test_apply)
 

@apache-chnsys
Copy link
Author

apache-chnsys commented Oct 14, 2021

Well I guess you are facing this issue cause you are not having a return value in your defined function of test_apply since pandas.apply() do corresponding mapping and you didn't returned anything therefore I believe that it is giving you the same key value pair. Also in you expected behaviour date is same but that will not be true with dt1.apply() function

import pandas as pd

def test_apply(data):
    print(data)
    data['date'] = pd.to_datetime(data['date'])
    return data

dt1 = pd.DataFrame({
                    'key': ['aa', 'bb', 'cc', 'dd'],
                    'date': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01'],
                    'qty': [1.0, 1.0, 3.0, 3.0]})
dt2 = dt1.groupby("key").apply(test_apply)
 

If I simply return data, there will be no problem, but if I return a processed data, there will still be problems.In addition, if I comment out the line to_datetime and there is no return value, there will still be no error

import pandas as pd

def test_apply(data):
  print(data)
  data['date'] = pd.to_datetime(data['date'])
  dt3 = pd.concat([data])
  return dt3

dt1 = pd.DataFrame({'key': ['aa', 'bb', 'cc', 'dd'],
                   'date': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01'],
                   'qty': [1.0, 1.0, 3.0, 3.0]})
data = dt1.groupby("key").apply(lambda x: test_apply(x)).reset_index(drop=True)
print(data)
 

@rhshadrach
Copy link
Member

I get the expected output when running the code in the OP on master.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Oct 16, 2021
@rhshadrach
Copy link
Member

In any case, mutating the pandas object within apply is not supported:

https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods

However, this note in the groupby.apply docs is missing; this is issue #36602

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 16, 2021
@CloseChoice
Copy link
Member

@rhshadrach So is this just an issue that'll be fixed once the docs are updated?

@apache-chnsys
Copy link
Author

apache-chnsys commented Oct 18, 2021

Well, I think users sometimes overlook this detail. And this bug only occurs when the group has the same number of rows before, which may make it difficult for users to find out.
@rhshadrach

def test_apply(data):
   data['date'] = pd.to_datetime(data['date'])
   data = pd.concat([data])
   return data

dt1 = pd.DataFrame({'key': ['aa', 'bb', 'bb', 'cc', 'dd'],
                   'date': ['2020-01-01', '2020-01-01', '2021-10-18', '2020-02-01', '2020-02-01'],
                   'qty': [1.0, 1.0, 2.0, 3.0, 3.0]})
data = dt1.groupby("key").apply(lambda x: test_apply(x)).reset_index(drop=True)
print(data)`

result:

   key     date        qty
0  aa 2020-01-01  1.0
1  bb 2020-01-01  1.0
2  bb 2021-10-18  2.0
3  cc 2020-02-01  3.0
4  dd 2020-02-01  3.0

@rhshadrach
Copy link
Member

@CloseChoice

@rhshadrach So is this just an issue that'll be fixed once the docs are updated?

I do not know, I cannot reproduce the result on master. Can others? This needs to be sorted out first before we can understand what needs to be done to close this issue.

@apache-chnsys

Well, I think users sometimes overlook this detail.

What does "this" refer to here?

And this bug only occurs when the group has the same number of rows before, which may make it difficult for users to find out.

I am not aware of any cases where this bug occurs on master; the OP reports this bug exists on master, can you confirm whether or not this is the case?

@apache-chnsys
Copy link
Author

apache-chnsys commented Oct 20, 2021

@rhshadrach
I tested it again, the python version where this problem occurred before was 3.6, and the above bug will not reproduce on 3.8.
But if I perform printing in the apply loop, the first element of the log table name is looped twice, as follows

def test_apply(data):
   print(data)
   data['date'] = pd.to_datetime(data['date'])
   a = pd.concat([data])
   return a

dt1 = pd.DataFrame({'key': ['aa', 'bb'],
                   'date': ['2020-01-01', '2020-01-01'],
                   'qty': [1.0, 1.0]})
data = dt1.groupby("key").apply(lambda x: test_apply(x)).reset_index(drop=True)

result:

 key        date  qty
0  aa  2020-01-01  1.0
key        date  qty
0  aa  2020-01-01  1.0
 key        date  qty
1  bb  2020-01-01  1.0

@rhshadrach
Copy link
Member

@apache-chnsys: This is part of the documented behavior. See the Notes section here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html

@rhshadrach
Copy link
Member

It seems everything is resolved here, and additional tests are not needed. If I've missed something, please let me know by replying here and this issue can be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

5 participants