Skip to content

Prevent Unlimited Agg Recursion with Duplicate Col Names #21066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 17, 2018
6 changes: 3 additions & 3 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -590,9 +590,10 @@ def _aggregate_multiple_funcs(self, arg, _level, _axis):

# multiples
else:
for col in obj:
for index, col in enumerate(obj):
try:
colg = self._gotitem(col, ndim=1, subset=obj[col])
colg = self._gotitem(col, ndim=1,
subset=obj.iloc[:, index])
results.append(colg.aggregate(arg))
keys.append(col)
except (TypeError, DataError):
Expand Down Expand Up @@ -675,7 +676,6 @@ def _gotitem(self, key, ndim, subset=None):
subset : object, default None
subset to act on
"""

# create a new object to prevent aliasing
if subset is None:
subset = self.obj
Expand Down
11 changes: 9 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5731,7 +5731,12 @@ def diff(self, periods=1, axis=0):
# ----------------------------------------------------------------------
# Function application

def _gotitem(self, key, ndim, subset=None):
def _gotitem(self,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize we don't have an overall strategy for annotations just yet but I had to think through this as I was debugging anyway, so figured I'd put here explicitly for when we turn this on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

key, # type: Union[str, List[str]]
ndim, # type: int
subset=None # type: Union[Series, DataFrame, None]
):
# type: (...) -> Union[Series, DataFrame]
"""
sub-classes to define
return a sliced object
Expand All @@ -5746,9 +5751,11 @@ def _gotitem(self, key, ndim, subset=None):
"""
if subset is None:
subset = self
elif subset.ndim == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line actually hit in tests? return self._constructor here doesn't make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is actually hit by the test case and code that was in place. A Series is a valid value for the subset parameter so I’m forcing it to a DataFrame or else the subsequent slice would fail. All for a better way if you think there’s one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what operation actually hits this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (which was changed to prevent the unlimited calls):

https://github.com/WillAyd/pandas/blob/4a24f734047d387ce242c36dba16eb69388a3ca1/pandas/core/base.py#L595

This would pass a Series before (unless column names were duplicated), though it never raised an error because of the code in _got_item. It would accept the Series and even use it in a condition, but would always just return a subset of itself...

Definitely convoluted - I think it was inadvertent before the way _got_item was implemented by DataFrame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the issue is why you are wrapping it with self._constructor which is a DataFrame here, then you are selecting it out again, just do

return subset i think is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're probably right with that - I suppose could just return immediately if ndim == 1. Will try locally and push if it works

subset = self._constructor(subset)

# TODO: _shallow_copy(subset)?
return self[key]
return subset[key]

_agg_doc = dedent("""
The aggregation operations are always performed over an axis, either the
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/frame/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -554,6 +554,14 @@ def test_apply_non_numpy_dtype(self):
result = df.apply(lambda x: x)
assert_frame_equal(result, df)

def test_apply_dup_names_multi_agg(self):
# GH 21063
df = pd.DataFrame([[0, 1], [2, 3]], columns=['a', 'a'])
expected = pd.DataFrame([[0, 1]], columns=['a', 'a'], index=['min'])
result = df.agg(['min'])

tm.assert_frame_equal(result, expected)


class TestInferOutputShape(object):
# the user has supplied an opaque UDF where
Expand Down