-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DISC: Validating internal docstring presence, completeness and style #32773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To start the discussion, I propose that we adopt google style guide for internal docstrings/code comments. http://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings We could also considering using pydocstyle http://www.pydocstyle.org/en/latest/ to validate docstrings against this style. (although may involve technically difficulties having two standards) |
What aspect of the google style guide do you propose to adopt? Or the full section about docstring comments? If that is the case, I don't really see why we would start using a different format to document parameters for internal functions vs public functions. |
The driver for adopting a different format, would be primarily to make it easier to have code checks for docstring coverage/quality for all of the codebase not just the public api. Is it feasible to use the current validation on all docstrings? The google style also includes styles for module docstrings and pydocstyle allows these to be validated. With the introduction of type annotations, having the types in the docstring is redundant (for internal functions) and duplication can allow inconsistencies if there is no checks to keep them in-sync. In the google style for Args, "List each parameter by name. A description should follow the name, and be separated by a colon and a space." so no need to repeat types. To enforce a style, we need validation. Feasibility on using our current validation across the codebase could either be discussed here or raise a new issue. There is #15580 but "this issue focused on public functions" #15580 (comment) |
I would like the outcome of this discussion to be either the current "pandas docstring guide" renamed "pandas api docstring guide" with a second, much more lightweight guide.. "pandas internal docstring guide" or rewrite the current guide to acknowledge the distinction. #32033 (comment) |
adopting "section 3.8 Comments and Docstrings" in it's entirety means that "pandas internal docstring guide" is effectively already written with pydocstyle providing the validation. |
Would it be viable to maintain a single standard and gradually expand it to more files/functions? |
I opened this issue primarily in order to improve coverage, so my motivation is to achieve greater breadth rather than depth. I therefore feel that having a more lightweight standard is more likely to achieve that goal. I do see issues with having a different standard and therefore opened the discussion with the google suggestion for feedback. I see some middle ground with the single standard approach, by just adopting a subset of the current standard for the internal docstrings. Potentially, we could then modify and use the current validation on the subset. We could take the subset approach a bit further changing say the requirement for types where type annotations exist. However, too may exceptions and we are in danger of creating "yet another" standard. In this were to be the case, adopting an exisiting standard, say the google standard may be more preferable. |
I'm not sure I understand the distinction. Can you give an example?
This relates to a topic I've been thinking about: should we treat annotations as "this is what you should pass" or "you can safely assume that this is what is passed"? e.g. in core.indexing we do the latter |
vs
I think that time/effort creating/reviewing docstings for functions that are less complex/more obvious would be better spent on more complex/less obvious functions. A more lightweight standard would hopefully reduce the time/effort creating/reviewing docstrings. |
There's also another.. "this is what you could pass". When I'm adding annotations, i go though the function and the inputs of the functions called to try and establish what types are acceptable. I think there maybe some merit in being more restrictive on the type accepted, i.e. should pass. This is not the place for that discussion though. If there is not already an issue for this, maybe we should have one. |
that is effectively what MonkeyType does. https://pypi.org/project/MonkeyType/ |
First, I am -1 on using a different (conflicting) style guide for internal docstrings than for external ones. I personally don't see the advantage of that.
That's a valid concern, but how does that relate to the styling of the docstrings? Because you have the feeling that with numpydoc style too much time goes to formatting details? And with google style this would be less? Note that with numpydoc standard, we can also have a subset of the rules to check for internal docstrings. We don't need to check every detail of it if we don't want to. And insn't that basically what your PR #32033 did? It checked some general formatting issues (using pydocstyle) that didn't conflict with numpydoc. |
mainly just against the formatting of args requiring types and conforming to the numpydoc type notation which is a different standard to the python typing notation used for type annotations.
That's basically the aim of this issue and what i'd like to get consensus agreement on. A more lightweight standard. i.e. not to need to meet all the requirements of numpydoc for an internal docstring. (to apply when reviewing/creating)
basically, yes. using a subset of error codes in the first instance that are compatible with numpydoc. If you recall there was a review request to update the docs 😉 Hopefully this discussion can provide the content, see #32773 (comment) However, #32033 didn't cover the type annotation/docstring redundancy issue. |
Numpydoc doesn't require types (see https://numpydoc.readthedocs.io/en/latest/format.html#sections, it has an example of a parameter without type), it's only our internal validation that requires this. We could perfectly well decide to relax this.
I think what would be interesting for the discussion then is to do a concrete proposal of which rules you want to relax or which rules ot keep. |
I share this same concern. There's already a rather large cognitive load to contributions in terms of guidelines and checks that is a lot to remember as a maintainer, and probably off-putting to new contributors. So OK with validating internal docstrings, but I think should play by the same rules as all other docstrings in the code base |
Agreed. This issue is intended to address that. whether a solution is to have a more lightweight standard or use the existing standard for internal docstrings, adding automated validation of internal docstrings is key to this discussion and the checks need to be part of the CI so that maintainers don't need to check conformance to docstring style. I should have made that more clear and included this in the OP as part of the sustainability argument. |
If we keep the current standard/validation for docstrings, I wonder whether it is feasible to run validate_docstrings on any methods/functions touched in a PR on CI. I suspect this could be quite disruptive. |
In the first instance, I think i'll run validate_docstrings on internal docstrings on a the odd PR from time to time that changes an internal docstring and see what we feel is inappropriate and maybe we can discuss which checks could be relaxed using specific cases, rather than select a subset from the outset. see #32769 |
Did we ever fix our user facing docstrings to adhere to the docstring standards? I know @datapythonista used to run metrics on that; not sure if there is a current list FWIW I think worth finishing that off before diving into standards for the internal functions, as there is probably lessons to be learned |
A lot of progress was made, but we're far from done with the public API. @martinagvilas and @galuhsahid have a dashboard on the current state. Can you share the link please? |
I think there are plenty of on-line resources and experience shared on how to use pandas. On the other hand, my concern is from a project sustainability perspective.
for me, getting the internal functions documented is just as important as the external api. |
But what is important for those internal functions is that they have a proper, understandable explanation of what they are doing / what they are used for. This is something that needs human review. On the other hand, most of the things that a style checker does (is there a space after the colon, did it use proper punctuation and capitalization, whitespace, ...), I care about much less for internal docstrings (they also don't need to render properly online). So in that light, I would say: yes, for fully conforming to the docstring validation script, let's first focus on the public docstrings. And at the same time, we reviewers need to pay attention that also internal docstrings have a meaningful explanation. |
Here is the link to the dashboard inspecting docstrings errors in the public API that @datapythonista mentioned. The current errors are updated every hour. |
Agreed, a lightweight set of checks could therefore be that a docstring exists and that if a parameters section is included that all parameters are included and that each have a description.
The majority view seems to be that we should have the same standard for internal/external docstrings. #32773 (comment), #32773 (comment) and #32773 (comment). I guess that we can keep the same standard for internal docstrings even if we don't validate against the complete standard on CI for internal docstrings.
makes sense. we could keep the validation minimal, say for the presence of a docstring and completeness of the parameters section. Conformance to the complete standard is at the discretion of the reviewer. |
Thanks for posting @martinagvilas |
Thanks @simonjayhawkins for mentioning this on the call today. I'd like to make the following proposal.
|
+1 on @rhshadrach proposal |
We currently have guidelines/checks for docstrings https://pandas.pydata.org/docs/development/contributing_docstring.html for the api facing methods/functions following the numpydoc docstring guide https://numpydoc.readthedocs.io/en/latest/format.html. since we also use sphinx for building the on-line documentation.
Many of our internal functions don't have docstrings and this could be a barrier to new contributions. For the more complex/undocumented internals, I am concerned that this may also affect the sustainability of the project if only a handful of contributors can grok the code.
The pandas docstring guidelines could be considered too heavyweight for internal functions and therefore may be factor in the incomplete coverage of docstrings for internal functions.
The text was updated successfully, but these errors were encountered: