Skip to content

groupby().agg() calls the user function for an extra time with empty inputs in 1.0.0 #31760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
icexelloss opened this issue Feb 6, 2020 · 10 comments · Fixed by #32121
Closed
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@icexelloss
Copy link
Contributor

icexelloss commented Feb 6, 2020

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({"key": ["a", "b", "c", "c"], "value": [1, 2, 3, 4]})                                                                                                                                                                                                                                                                                                                                                                                                                        

def foo(x): 
    print(f"called with {x}") 
    return len(x) 

df.groupby("key")["value"].agg(foo)

Problem description

In 1.0.0, this code outputs:

called with Series([], Name: value, dtype: int64)
called with 0    1
Name: value, dtype: int64
called with 1    2
Name: value, dtype: int64
called with 2    3
3    4
Name: value, dtype: int64
Out[10]: 
key
a    1
b    1
c    2
Name: value, dtype: int64

In 0.25.3, the code outputs:

In [5]:  df.groupby("key")["value"].agg(foo)                                                                                                                                                                                                                                                                                                                                                                                                                                                           
called with 0    1
Name: value, dtype: int64
called with 1    2
Name: value, dtype: int64
called with 2    3
3    4
Name: value, dtype: int64
Out[5]: 
key
a    1
b    1
c    2
Name: value, dtype: int64

In 1.0.0, foo is called one more time with empty input, which can break user code

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

@icexelloss icexelloss changed the title groupby().agg() calls the user function for an exact time with empty inputs groupby().agg() calls the user function for an extra time with empty inputs Feb 6, 2020
@icexelloss icexelloss changed the title groupby().agg() calls the user function for an extra time with empty inputs groupby().agg() calls the user function for an extra time with empty inputs in 1.0.0 Feb 6, 2020
@mroeschke mroeschke assigned mroeschke and unassigned mroeschke Feb 7, 2020
@mroeschke
Copy link
Member

@jbrockmendel looks related to one of your changes 08087d6. Is this avoidable?

@jbrockmendel
Copy link
Member

looks related to one of your changes 08087d6. Is this

Not at the moment, xref #31759.

@icexelloss
Copy link
Contributor Author

icexelloss commented Feb 11, 2020

Hi thanks for the response! Do we think this is a bug and that we should restore the original behavior?

@jbrockmendel
Copy link
Member

Do we think this is a bug and that we should restore the original behavior?

Not really. This behavior has been in place for groupby.apply for a while.

@icexelloss
Copy link
Contributor Author

I am not it's true that the behavior has been in place for a while. As far as I can tell this is behavior introduced in 1.0.0, which has just been released.

Also this behavior doesn't totally make sense to me - why groupby.agg() needs to call the user function with empty input, it feels like a side effect of the implementation detail and not really an intentional behavior change to improve user ability.

I fully respect your decision but want to explore here whether this is the correct thing to do. We do find this new behavior breaking our existing code.

@jreback jreback added this to the 1.0.2 milestone Feb 12, 2020
@jreback jreback added the Regression Functionality that used to work in a prior pandas version label Feb 12, 2020
@jreback
Copy link
Contributor

jreback commented Feb 12, 2020

Do we think this is a bug and that we should restore the original behavior?

Not really. This behavior has been in place for groupby.apply for a while.

@jbrockmendel this appears to be a regression

@jbrockmendel
Copy link
Member

I am not it's true that the behavior has been in place for a while.

I was referring to previously-existing behavior in groupby.apply, not groupby.agg.

Also this behavior doesn't totally make sense to me - why groupby.agg() needs to call the user function with empty input, it feels like a side effect of the implementation detail and not really an intentional behavior change to improve user ability.

We have a fastpath that goes through cython and a slowpath that stays in python. The call on an empty input is part of an attempt to figure out if we can go through the fastpath. (FWIW, I'm increasingly of the opinion that the cython fastpath isnt worth the maintenance burden)

We do find this new behavior breaking our existing code.

As of the last time I worked on this code (circa October I think) this was chosen as the least-bad option available. If you'd like to make a PR with a better alternative, go for it.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2020

@jbrockmendel something broke in 1.0.x for .agg
as
this is a regression

@WillAyd maybe have some insight as iirc some of this was touched

@jbrockmendel
Copy link
Member

I think #32083 would address this.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2020

I think #32083 would address this.

hmm that’s pretty big to backport

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants