Skip to content

Prevent Unlimited Agg Recursion with Duplicate Col Names #21066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 17, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented May 15, 2018

The _gotitem implementation for DataFrame seems a little strange so there may be a more comprehensive approach, but this should prevent the issue for the time being

Didn't add whatsnew yet since none existed for 0.23.1. Happy to add if we are OK with this fix

@WillAyd WillAyd changed the title Frame recur bug Prevented Unlimited Agg Recursion with Duplicate Col Names May 15, 2018
@TomAugspurger
Copy link
Contributor

I have #21001 for the whatsnew. Will merge after tagging 0.23.0

@WillAyd WillAyd changed the title Prevented Unlimited Agg Recursion with Duplicate Col Names Prevent Unlimited Agg Recursion with Duplicate Col Names May 15, 2018
@codecov
Copy link

codecov bot commented May 15, 2018

Codecov Report

Merging #21066 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21066      +/-   ##
==========================================
- Coverage   91.83%   91.83%   -0.01%     
==========================================
  Files         153      153              
  Lines       49495    49497       +2     
==========================================
  Hits        45454    45454              
- Misses       4041     4043       +2
Flag Coverage Δ
#multiple 90.22% <100%> (-0.01%) ⬇️
#single 41.88% <16.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/base.py 96.83% <100%> (ø) ⬆️
pandas/core/frame.py 97.23% <100%> (ø) ⬆️
pandas/util/testing.py 84.38% <0%> (-0.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 501f041...c0dc3f4. Read the comment docs.

@WillAyd WillAyd force-pushed the frame-recur-bug branch from ac4ddb5 to f0135e8 Compare May 15, 2018 23:12
@@ -5731,7 +5731,12 @@ def diff(self, periods=1, axis=0):
# ----------------------------------------------------------------------
# Function application

def _gotitem(self, key, ndim, subset=None):
def _gotitem(self,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize we don't have an overall strategy for annotations just yet but I had to think through this as I was debugging anyway, so figured I'd put here explicitly for when we turn this on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

@@ -5746,9 +5751,11 @@ def _gotitem(self, key, ndim, subset=None):
"""
if subset is None:
subset = self
elif subset.ndim == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line actually hit in tests? return self._constructor here doesn't make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is actually hit by the test case and code that was in place. A Series is a valid value for the subset parameter so I’m forcing it to a DataFrame or else the subsequent slice would fail. All for a better way if you think there’s one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what operation actually hits this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (which was changed to prevent the unlimited calls):

https://github.com/WillAyd/pandas/blob/4a24f734047d387ce242c36dba16eb69388a3ca1/pandas/core/base.py#L595

This would pass a Series before (unless column names were duplicated), though it never raised an error because of the code in _got_item. It would accept the Series and even use it in a condition, but would always just return a subset of itself...

Definitely convoluted - I think it was inadvertent before the way _got_item was implemented by DataFrame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the issue is why you are wrapping it with self._constructor which is a DataFrame here, then you are selecting it out again, just do

return subset i think is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're probably right with that - I suppose could just return immediately if ndim == 1. Will try locally and push if it works

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 16, 2018
@TomAugspurger
Copy link
Contributor

FYI, the whasnew is available now.

@TomAugspurger TomAugspurger added this to the 0.23.1 milestone May 16, 2018
@jreback
Copy link
Contributor

jreback commented May 17, 2018

lgtm. ping on green. note am happy to simplification / refactoring of code here, though there are a lot of usecases so this is tricky.

@WillAyd
Copy link
Member Author

WillAyd commented May 17, 2018

Failed on AppVeyor but does not appear to be related (looks like a UserWarning wasn't being thrown for an io test with the PythonParser?)

@TomAugspurger
Copy link
Contributor

Looks good. I'll keep an eye on the appveyor test in case it's relevant.

@TomAugspurger TomAugspurger merged commit d623ffd into pandas-dev:master May 17, 2018
@WillAyd WillAyd deleted the frame-recur-bug branch May 17, 2018 14:31
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

segfault in DataFrame.apply with duplicate column names and multiple aggfuncs
4 participants