Skip to content

DOC: improve groupby reference docs #6944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 of 9 tasks
jorisvandenbossche opened this issue Apr 23, 2014 · 6 comments
Closed
5 of 9 tasks

DOC: improve groupby reference docs #6944

jorisvandenbossche opened this issue Apr 23, 2014 · 6 comments
Labels
Docs good first issue Groupby Master Tracker High level tracker for similar issues

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 23, 2014

An overview of the reference doc on groupby is given here: http://pandas.pydata.org/pandas-docs/dev/api.html#groupby (apart from the extensive user guide: http://pandas.pydata.org/pandas-docs/dev/groupby.html)

There are some things that could use some improvement:

  • add some missing functions to the overview in api.rst (Groupbydocs #8231)
    • GroupBy.filter
    • first/last/nth
    • count, cumcount, ..
    • name: not sure what the purpose of this is
  • add the GroupBy object itself to the api docs (and so automatically all its methods) (DOC: SeriesGroupby/DataFrameGroupBy is missing class documentaion from doc index #19302)
  • put all relevant docstrings in the GroupBy class, and not only in the subclasses DataFrameGroupBy, SeriesGroupBy (eg now the aggregate and transform docstrings of GroupBy are empty, but are more elaborate in the subclasses) (Groupbydocs #8231)
  • general clean-up of all the docstrings
    • especially the apply docstring is not very clear to me
  • expand DataFrame/Series.groupby() docstring:
    • clearly list all possibilities for the by arg (and provide some short examples in the 'Examples' section)
  • document the whitelisted methods: (Groupbydocs #8231)
    • this could eg be done by injecting it in the docstring automatically based on _apply_whitelist
    • or alternatively by ensuring they appear in the methods list of the GroupBy class (which is not the case at the moment, only in instantiated objects) (see also discussion in No API reference for DataFrameGroupBy and "combining" step #2644)
  • More clearly document the DataFrameGroupBy and SeriesGroupBy classes (Groupbydocs #8231)
    • at least mention them in the docs
    • one idea is to have a DataFrameGroupBy api pages that just redirects to the general GroupBy page
  • Add docstrings to the wrapped whitelisted functions. Eg at present g = df.groupby(...); g.count? is returning <no docstring> (see Docs for lurking groupby methods #4500 (comment) for explanation how)
  • Make a clear distinction, about what to expect for the return values of a grouped-apply, e.g. head/tail/nth are basically filter type of functions, fillna/shift are transformers, while almost everything else is a reducer (e.g. sum/mean/describe), while apply/agg can be any of the above. hmm. maybe needs a separate section for this. (and of course as_index just makes this crazy)

If someone wants to tackle this (or parts of this), go ahead!

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2014

👍 nice list

@mcjcode
Copy link
Contributor

mcjcode commented Aug 29, 2014

Question: Are all of the items on the SeriesGroupBy and DataFrameGroupBy whitelists meant to pass through to the underlying Series and DataFrame members? From what I can tell there are many items on the respective whitelists that are handled before they can ever be so dispatched. Here is a list:

first, last, min, max, sum - all defined as members in the GroupBy class definition (using _groupby_function). (GroupBy.prod is also defined in this way, but is not whitelisted.) And boxplot is also a defined member of DataFrameGroupBy (but not SeriesGroupBy).

count, cumcount, head, tail, mean, median - all defined, with def keyword, in the GroupBy class definition.

I ask because I'm proposing to define the whitelisted methods at class definition time, instead of relying on the GroupBy.__getattr__ to dynamically create them at the time they are invoked. And so sphinx will generate the documentation for all passed-through methods just like the ones that have first-class definitions in the Class.

But that often overrides the explicit method definitions listed above. And sometimes they don't even do the same thing. Which makes me suspect that these dozen or so whitelisted methods no longer really should be handled in this way (i.e. they already have explicit class definitions that take priority.)

So, can anyone think of a reason why we shouldn't just remove these names from the {Series,DataFrame}GroupBy whitelists? If not, I'll make a pull request addressing #6944 that uses this as part of the solution (with an appropriately updated test_groupby:test_groupby_whitelist.)

@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

by definition the cythonized functions (and a few others which sometimes be cythonized) are defined inline on the groupby objects (eg min,mean,sum etc)
so these are not passed thru

you can define the whitelist methods in the classes (and just have the getattr raise an error I suppose)
(and this define the doc strings) - if a method exists in the class it is called before getattr is ever called so of course these override the whitelist

prod is prob an error that it is not defined
some methods only work in frames and not on series so that is done on purpose (boxplot)

@mcjcode mcjcode mentioned this issue Sep 10, 2014
@jreback jreback modified the milestones: 0.15.0, 0.15.1 Sep 11, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@TomAugspurger
Copy link
Contributor

I think this will be easier with Python3 since we can rewrite function signatures without much hassle (something like

def rewrite_axis_style_signature(name, extra_params):
). Then we can get proper signatures instead of **kwargs.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@lmorato
Copy link

lmorato commented Feb 21, 2023

CAN I HELP?
(as former professor, good at scrutinizing long and "boring" documents.)
Hi, everybody,

my name is Lucas. I have a PhD in development economics and "greener" python enthusiast. Unfortunately, I don't have the time just right now to look at @jorisvandenbossche comments with the care that they deserve, but I did want to make my own remark on something I noticed that probably needs a small adjustment (I'm also sorry for not having had time to read the other posts, which i intend to do latter.).

At Pandas' website page "pandas.core.groupby.GroupBy.agg", which seems to mee to be one of great importance to people starting to wrap their minds around python's way of referencing and slicing different ranges, there is no mention about how the code works. By glancing over the source code for the funcion, i noticed 'agg' is short for 'aggregate'. But, since people over the web seems to be using the short version of the method, wouldn't it be nice to have a redirectioning link for new students? Moreover, i also noticed that the following section of the website, titled "pandas.core.groupby.SeriesGroupBy.aggregate", contains info that seems to be relevant to the previous section.

This seems to have some connection with the third issue mentioned by , which is: "put all relevant docstrings (...) eg now the aggregate (...) docstrings of GroupBy are empty, but are more elaborate in the subclasses)". Now, I still am not very good at coding, but, as a former college professor in Brazil, I can say with relative confidence that i am good at reading what most people consider "boring" stuff, and at finding small things others don't give attention to. That being said, I would like to know how can i help on this issue? Thank you. Have a nice day!
good-first-issue_groupbyandagg

@mroeschke
Copy link
Member

It appears most of the issues in the original post are addressed so closing. If there are specific groupby doc issues, it would be better to open specific issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue Groupby Master Tracker High level tracker for similar issues
Projects
None yet
Development

No branches or pull requests

7 participants