-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: aggregations were getting overwritten if they had the same name #30858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: aggregations were getting overwritten if they had the same name #30858
Conversation
nice, thanks for the PR @MarcoGorelli I think this PR deserves a new issue other than #30092 , so would suggest to |
65ae0c6
to
5e9fe4e
Compare
Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-07-14 18:13:33 UTC |
83398f8
to
f84483b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job looks pretty good
42e9571
to
33c57a2
Compare
33c57a2
to
3e648e1
Compare
44e562c
to
521bc1d
Compare
Hi @WillAyd - sorry to chase you up, just wanted to ask if there's anything else that needs doing here or if it's alright (or indeed if it's the wrong fix altogether :) ) |
pandas/core/groupby/generic.py
Outdated
|
||
return DataFrame(results, columns=columns) | ||
return {key.label: value for key, value in results.items()} | ||
return DataFrame(self._wrap_aggregated_output(results), columns=columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the DataFrame constructor still required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test
pytest pandas/tests/groupby/aggregate/test_aggregate.py::test_aggregate_item_by_item
when we get here we have
(Pdb) results
{OutputKey(label='<lambda>', position=0): A
bar 3
foo 5
Name: B, dtype: int64}
(Pdb) self._wrap_aggregated_output(results)
A
bar 3
foo 5
Name: <lambda>, dtype: int64
(Pdb) type(self._wrap_aggregated_output(results))
<class 'pandas.core.series.Series'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd have updated with a call to .to_frame
(if necessary)
Before this is merged, I should add a test case using pd.NamedAgg, xref #34380 |
friendly ping :) |
pandas/core/groupby/generic.py
Outdated
|
||
if any(isinstance(x, DataFrame) for x in results.values()): | ||
# let higher level handle | ||
return results | ||
|
||
return self.obj._constructor_expanddim(results, columns=columns) | ||
if not results: | ||
return DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm is this correct? do we have tests that hit this. I would think we would have somthing e.g. columns even if this is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also why is this not just handled in wrap_aggregated_output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a test that hits it: pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises
When trying to move this to wrap_aggregated_output
I ran into #34977, so I'll try to address that first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output
what do you get as output
? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.
So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback if we pass {}
to _wrap_aggregated_output
we get a KeyError
.
Here's the traceback:
============================= test session starts ==============================
platform linux -- Python 3.8.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
rootdir: /home/marco/pandas-dev, inifile: setup.cfg
plugins: xdist-1.32.0, cov-2.10.0, asyncio-0.12.0, hypothesis-5.16.1, forked-1.1.2
collected 1 item
pandas/tests/groupby/aggregate/test_aggregate.py F [100%]
=================================== FAILURES ===================================
________________ TestNamedAggregationSeries.test_no_args_raises ________________
self = <pandas.tests.groupby.aggregate.test_aggregate.TestNamedAggregationSeries object at 0x7f8835d975b0>
def test_no_args_raises(self):
gr = pd.Series([1, 2]).groupby([0, 1])
with pytest.raises(TypeError, match="Must provide"):
gr.agg()
# but we do allow this
> result = gr.agg([])
pandas/tests/groupby/aggregate/test_aggregate.py:555:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:247: in aggregate
ret = self._aggregate_multiple_funcs(func)
pandas/core/groupby/generic.py:328: in _aggregate_multiple_funcs
output = self._wrap_aggregated_output(results)
pandas/core/groupby/generic.py:387: in _wrap_aggregated_output
result = self._wrap_series_output(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f8835d97ac0>
output = {}, index = Int64Index([0, 1], dtype='int64')
def _wrap_series_output(
self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]], index: Index,
) -> Union[Series, DataFrame]:
"""
Wraps the output of a SeriesGroupBy operation into the expected result.
Parameters
----------
output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
Data to wrap.
index : pd.Index
Index to apply to the output.
Returns
-------
Series or DataFrame
Notes
-----
In the vast majority of cases output and columns will only contain one
element. The exception is operations that expand dimensions, like ohlc.
"""
indexed_output = {key.position: val for key, val in output.items()}
columns = Index(key.label for key in output)
result: Union[Series, DataFrame]
if len(output) > 1:
result = self.obj._constructor_expanddim(indexed_output, index=index)
result.columns = columns
else:
result = self.obj._constructor(
> indexed_output[0], index=index, name=columns[0]
)
E KeyError: 0
pandas/core/groupby/generic.py:362: KeyError
-------------- generated xml file: /tmp/tmp-31663hvopHHCRFEu.xml ---------------
=========================== short test summary info ============================
FAILED pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises
============================== 1 failed in 0.22s ===============================
The problem is this line which access [0]
on an empty object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would just fix this, need to check if len(indexed_output)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So,
elif len(indexed_output):
result = self.obj._constructor(
indexed_output[0], index=index, name=columns[0]
)
else:
result = self.obj._constructor()
?
I can do that, but then I'll still have to address #34977 when the output of _wrap_aggregated_output
is passed to self.obj._constructor_expanddim(results, columns=columns)
.
its possible to pass columns to _wrap_aggregated_output
Are you sure? It seems to only take on argument (other than self
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback would
elif len(indexed_output):
result = self.obj._constructor(
indexed_output[0], index=index, name=columns[0]
)
else:
return None
be an acceptable solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually,
elif not columns.empty:
result = self.obj._constructor(
indexed_output[0], index=index, name=columns[0]
)
else:
result = self.obj._constructor_expanddim()
works, because
pd.DataFrame(pd.DataFrame(), columns=[])
is allowed.
No need to modify the return types like this :)
pandas/core/groupby/generic.py
Outdated
|
||
if any(isinstance(x, DataFrame) for x in results.values()): | ||
# let higher level handle | ||
return results | ||
|
||
return self.obj._constructor_expanddim(results, columns=columns) | ||
if not results: | ||
return DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output
what do you get as output
? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.
So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output
this needs to go after #34998 which substantially refactors things |
OK, thanks for letting me know, will try rebasing locally to get this ready |
@MarcoGorelli hmm we are going to be refactoring #34998 / delaying it, so let's see if we can get this one in. pls rebase and see if you can address above comments. |
@jreback have merged master Regarding above comments, seems the only outstanding issue is handling when we pass an empty result to |
@TomAugspurger if you'd have a quick look. |
@@ -1101,6 +1101,7 @@ Reshaping | |||
- Bug in :func:`crosstab` when inputs are two Series and have tuple names, the output will keep dummy MultiIndex as columns. (:issue:`18321`) | |||
- :meth:`DataFrame.pivot` can now take lists for ``index`` and ``columns`` arguments (:issue:`21425`) | |||
- Bug in :func:`concat` where the resulting indices are not copied when ``copy=True`` (:issue:`29879`) | |||
- Bug in :meth:`SeriesGroupBy.aggregate` was resulting in aggregations being overwritten when they shared the same name (:issue:`30880`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: the link to this method won't render, since SeriesGroupBy
isn't in the pands namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that - will make sure the build the whatsnew file in the future to check
Thanks @MarcoGorelli! |
This one took me a while (I opened it in January!) so thanks for bearing with me while I worked on it! |
…andas-dev#30858) * 🐛 aggregations were getting overwritten if they had the same name
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
xref #30092