Skip to content

BUG: Fix groupby.apply #28662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jan 1, 2020
Merged

BUG: Fix groupby.apply #28662

merged 22 commits into from
Jan 1, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Sep 27, 2019

Makes sure that the output of groupby.apply is built up by value instead of by reference in reduction.pyx to avoid the behavior from #28652.

@gfyoung gfyoung added the Indexing Related to indexing on series/frames, not to indexes themselves label Sep 28, 2019
@dsaxton
Copy link
Member Author

dsaxton commented Sep 28, 2019

Somewhat unrelated, but does isort skip over pyx files during checks? I've noticed it doesn't like the import order / formatting of reduction.pyx.

@alimcmaster1
Copy link
Member

Somewhat unrelated, but does isort skip over pyx files during checks? I've noticed it doesn't like the import order / formatting of reduction.pyx.

Correct the --recursive option only actually checks .py and .pyi files. As per isort code here.

Even though one can explicitly sort a given pyx file eg. running isort pandas/_libs/reduction.pyx works. Maybe @timothycrosley knows if this is expected behaviour?

cc. @jbrockmendel to comment on if its worth us bothering to isort .pyx files?

@timothycrosley
Copy link

@alimcmaster1, yes this is intended, for now. There has been an open ticket for some time to add full Cython awareness (cimports, cdefs, etc...) but for now it remains incomplete. Because of this, we currently intentionally don't automatically include pyx files. Once this ticket is resolved, isort will include pyx files by default.

Hope this is helpful!

~Timothy

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 1, 2019

to comment on if its worth us bothering to isort .pyx files?

I've tried to make these internally consistent, but isort wont work on these since they are not valid python.

update looks like I was wrong about isort not working on them. still for now "get it close enough manually" makes sense to me

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @jreback

@jreback jreback added this to the 1.0 milestone Oct 22, 2019
@jreback
Copy link
Contributor

jreback commented Oct 22, 2019

lgtm. can you rebase and small comment. ping on green.

@jbrockmendel
Copy link
Member

Another approach we could take would be to update some combination of libreduction._check_result_array and libreduction._extract_result to check for Index objects and redirect us back to the pure-python implementation, which already does this corrrectly.

@alimcmaster1
Copy link
Member

@dsaxton - feels like this is almost there? Seems like the test failures mentioned above are now resolved.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

@jreback jreback merged commit 7c9042a into pandas-dev:master Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

thanks @dsaxton

not generally happy with the way we do reductions in cython, but a larger discussion / refactor effort.

@dsaxton dsaxton deleted the grp-apply branch January 1, 2020 18:52
hweecat pushed a commit to hweecat/pandas that referenced this pull request Jan 1, 2020
@jorisvandenbossche
Copy link
Member

@dsaxton this caused a performance regression for some of our benchmarks, eg https://pandas.pydata.org/speed/pandas/#groupby.Apply.time_copy_overhead_single_col?commits=6efc2379-b9de33e3. Now, it might be this is unavoidable since the extra copy is to avoid a bug, but can you take a look at that benchmark?

@dsaxton
Copy link
Member Author

dsaxton commented Jan 7, 2020

@dsaxton this caused a performance regression for some of our benchmarks, eg https://pandas.pydata.org/speed/pandas/#groupby.Apply.time_copy_overhead_single_col?commits=6efc2379-b9de33e3. Now, it might be this is unavoidable since the extra copy is to avoid a bug, but can you take a look at that benchmark?

@jorisvandenbossche Interesting, it looks like the copy was previously avoided because each group was getting copied by the function in the benchmark, so the identity check on the indexes wasn't triggering (if piece.index is chunk.index).

Not really sure what the best approach would be here; would you recommend adding an explicit check like piece.index is not chunk.index somewhere so that we know copying isn't necessary, or is that getting a bit too messy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby indexing is giving the wrong index
8 participants