Skip to content

DOC: factor out multiple instances of dtype_backend parameter descriptions into _shared_docs #53881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

tpaxman
Copy link
Contributor

@tpaxman tpaxman commented Jun 27, 2023

@tpaxman tpaxman requested a review from rhshadrach as a code owner June 27, 2023 13:31
@mroeschke
Copy link
Member

Looks like code checks are failing (also ensure you have the pre-commit checks configured)

@mroeschke mroeschke added the Docs label Jun 27, 2023
@tpaxman
Copy link
Contributor Author

tpaxman commented Jun 27, 2023

Thanks @mroeschke I need to have a look at it later

@tpaxman tpaxman marked this pull request as draft June 28, 2023 05:30
@tpaxman tpaxman marked this pull request as ready for review July 2, 2023 01:51
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your PR

has this been discussed anywhere? to be honest I'm not really keen on adding such complexity to docstrings.
I'm all for making reusable functions in source code, but in docs and tests, repetition isn't too bad

@tpaxman
Copy link
Contributor Author

tpaxman commented Jul 6, 2023

@MarcoGorelli, thanks for the response! I don't think it was discussed specifically anywhere (at least that I know of). Mainly, I was just noticing that some parameter descriptions had been defined in pandas/core/shared_docs.py (e.g., the description for the compression parameter used in a few of the IO functions is defined there in _shared_docs["decompression_options"]).

That got me thinking that maybe the dtype_backend description would be a good candidate for the same implementation, i.e. defined in shared_docs and referenced by format strings, formatted by the @doc decorator , in each of the 16 functions where it is used.

Actually the original reason I ended up going down this path was that I meant to suggest some small edits to the wording of the dtype_backend parameter description in read_csv, such as formatting and function cross-referencing, and then I noticed that the parameter is used in many places always with the same description and didn't want to make them inconsistent with one another.

All that being said, I am totally fine to not go this direction with the docstring and I definitely agree that too much string manipulation and preprocessing to create each docstring might just obfuscate the code's inherent documentation and I think over-coupling may definitely be a concern here, especially in cases where functions evolve in their own way and maybe parameters, whose implementation details were initially the same across each function they are associated with, will eventually come to mean slightly different things. In those cases, the shared docstring would have to be unshared again so that the docstring description is specific to one particular function and not the set of all functions it appears in.

The counterpoint to that might be that because these are all IO functions, maybe the parameter is expected to produce identical, or at least analogous, behaviour across the whole suite of IO functions, and would lend itself well to being defined in a consistent way in each function, such that "sharing" that piece of the docstring makes sense. But again like you say I'm not sure if it's worth the trouble or the complexity of defining it in that way.

The middle ground might be having an expectation that any PR-with a docstring edit could require that the change be made such that all similar parameters in related functions get the same treatment, i.e. in this case I'd just submit a PR where any formatting changes or edits to the dtype_backend parameter would just be replicated across the code base rather than factoring the paragraph out, which of course leaves more room for error given the duplication but does leave the docstring framework generally more flexible.

The bottom line is I'm happy to help with whatever your group would prefer! Just glad to contribute little bits wherever it helps.

@MarcoGorelli
Copy link
Member

Thanks for explaining

OK yeah, if it's repeated this number of times, it's probably fine to share the docstrings. I get a bit annoyed when I see complex decorators which save something from being written twice, but if it's 16 times, then you're right, that's a different story 😄

will take a closer look later, but should be good

@@ -352,13 +354,13 @@ def _parse_tfoot_tr(self, table):
"""
raise AbstractMethodError(self)

def _parse_tables(self, doc, match, attrs):
def _parse_tables(self, dom_doc, match, attrs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems irrelevant (and I don't know if it affects anyone who might be subclassing) - can we revert this please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I made this change because I thought there was some conflict between the doc parameter in the _parse_tables function and the @doc decorator being applied, but now that you've suggested to just go in and change each instance anyway, this change will certainly be reverted along with the others so no worries there.

@MarcoGorelli
Copy link
Member

I meant to suggest some small edits to the wording of the dtype_backend parameter description in read_csv, such as formatting and function cross-referencing, and then I noticed that the parameter is used in many places always with the same description and didn't want to make them inconsistent with one another.

All that being said, I am totally fine to not go this direction

Might be simpler to just do this to be honest, docstring substitutions are a common source of confusion (not just for new contributors, but for anyone)

It's also harder to review

Could you just make the changes to each place the description appears please?

@rhshadrach
Copy link
Member

Keeping the documentation consistent is certainly an issue. It doesn't seem to me Python has sufficient tooling (including third party libraries, but I've only searched a little) to keep documentation DRY. Another way one can enforce consistency is via tests / linters. I've been wondering how difficult it would be to parse the docstrings into sections and detect when an argument is reused with different descriptions. Would need to have mechanisms to ignore cases and probably be opt-in rather than opt-out.

@tpaxman
Copy link
Contributor Author

tpaxman commented Jul 13, 2023

Thank you, @MarcoGorelli and @rhshadrach for your feedback on this. It definitely seems like a tricky aspect of code maintenance, especially when many functions share parameters and have similar functionality.

Based on my understanding of @MarcoGorelli's suggestion, I have abandoned the approach used in this pull request and have submitted a fresh PR where I have edited each instance of the dtype_backend parameter description such that they all match (see PR #54104). I was also wondering whether choosing to go this direction here means that it would be desirable to 'revert' cases where we are using a string defined in shared_docs as a format string input across multiple function docstrings (e.g., _shared_docs["decompression_options"] for the compression parameter) such that the docstrings each just have an identical repeat of the approved description.

This relates to the suggestion from @rhshadrach about maintaining consistency for parameters across multiple functions, which I was thinking about some more: the idea of handling it via linters and checks definitely feels like the right way to go if the desire is to have the docstrings written directly within each function in a readable way (i.e., without much or any format string substitution to obfuscate meaning). I would be interested in maybe tackling the issue of parsing and splitting up any Sphinx docstring via some side-project tool maybe that could be used for developing a linter but wanted to see if there are already similar tools available.

An alternative I thought about would be taking this in the opposite direction, where instead of having plain-text docstrings in each function and using linters and checks to enforce consistency for shared/duplicated descriptions across functions whenever changes are made, the docstrings for public functions could instead be extracted out and maintained within a set of files dedicated to composing each docstring from a set of descriptions and snippets, many of which would be shared. The docstrings would then be attached to their respective functions via some decorator like @doc or @Appender. This approach would have the advantage of reducing duplication in written text and would provide a means to see the relationships between all the docstrings and would make it simpler for someone to make a parameter description change in a single place to see how it will propagate out to all associated docstrings, rather than having to know that docstring consistency will be maintained at the pre-commit stage which is a bit more of a black box maybe.

Both have their pros and cons and I am not sure which would be ideal. Interested to hear more thoughts and I also want to do a bit more research on potential solutions to this. It is interesting as @rhshadrach mentioned that Python does not seem to have great tooling for this problem that has to be addressed by any code base. Curious to hear thoughts on this.

@MarcoGorelli
Copy link
Member

superseded by #54104, so closing - thanks again!

@rhshadrach
Copy link
Member

the docstrings for public functions could instead be extracted out and maintained within a set of files dedicated to composing each docstring from a set of descriptions and snippets, many of which would be shared. The docstrings would then be attached to their respective functions via some decorator like @doc or @Appender.

I've wondered if this is a good path too, but haven't pursued it enough to know the answer. My experience with pandas docstrings makes me think it is quite hard to have a mix of hard coded vs composed parts, but perhaps with an advanced enough solution it's possible to make this ergonomic.

Another (quite minor) consideration is the impact of users with IDEs. For example, here is the docstring of DataFrameGroupBy.sum as reported by PyCharm.

image

For pandas, the bar is quite high to be added as a hard dependency. I would imagine such a mechanism would need to be vendored by pandas rather than installed as a third party library. Still, perhaps other packages could benefit from such a third party library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: The dtype_backend parameter description is repeated manually in 16 functions
4 participants