Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ENH: Support nested renaming / selection #26399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support nested renaming / selection #26399
Changes from 23 commits
aa43cf6
8bd8e31
10c8f40
2e52653
06a86ec
9e636c1
14f66e6
2c3d11a
cdf9373
2c544f0
c0cd575
386cca1
2f6e1dc
6d8a18a
6c1f567
bcc63f5
769a909
1da90d4
a028f48
0ddd51f
42e69a1
769d7d3
1cee0e2
6369eb1
02d7169
eb9ba8f
7df14d7
cf8db51
9501e82
d65afe4
25dca1a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not show the example in a mixed form (as this is something we really don't want to recommend I think?). I would maybe just show it twice, eg first with tuples and then with comment
# using more explicit syntax
with namedtuple. Then it also shows that both result in the same.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these are technically identifiers instead of keywords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to this PR, but there is actually a note a few lines below about "the ordering of the output columns is non-deterministic" that can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a TypeVar instead of a Union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to show the namedtuple version here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should encourage the namedtuple version in the docs. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking as a user, no I don't think that's the best.
The old way was delightfully concise (dict of dicts) but not very explicity, the new way with keywords is slightly less concise (repetition of the selection column) but not by much and more consistent so it's pretty OK.
But the named tuple version is very verbose and - - unless my code was designed for other people and need to be self documenting - - in the usual case of using pandas in Notebooks for my own work, once the user understands the concept, using the namedtuple version is just making the code less readable due to the lower ratio of user content (column names and aggfunc names) / boilerplate content (pd.NamedAgg, column=, aggfunc=).
Moreover, letting the simple keyword-tuple version having the most emphasis is more likely to help intermediate and advanced python user think that this system of aggregation can be easily made more dynamic inside code through usage of the **{} construct, where the dict to unpack is constructed by the code, and only is a dict of tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger wrote:
I am not saying not to include namedtuple in the documentation.
The question was whether it should be encouraged or simply mentioned. I think mention is enough.
The non-namedtuple version is pretty straightforward to understand once you see one example: for example with
.agg(average_weight=('height', np.mean))
it is clear by reading this that themean
of theheight
column would go into the columnaverage_weight
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions:
The keywords are the column names
-->The keywords are the output column names
and, optionally, to mirror the main documentation which is clearer IMHO:
the values should be 2-tuples where the first element is the column selection and the second element is the aggfunc
-->
the values are 2-tuples whose first element is the column to select and the second element is the aggregation function to apply to that column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string here