Skip to content

Make validate_docstrings.py ready for the CI #23481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Nov 4, 2018 · 5 comments · Fixed by #23514
Closed

Make validate_docstrings.py ready for the CI #23481

datapythonista opened this issue Nov 4, 2018 · 5 comments · Fixed by #23514
Assignees
Labels
CI Continuous Integration Code Style Code style, linting, code_checks Docs

Comments

@datapythonista
Copy link
Member

Since #22408 was merged, validate_docstrings.py is able to validate all docstrings when called without parameters, returning the output as a json file.

Next step would be to make the script ready to be used in the CI, so we can fail when there are validation errors in the changes to a docstring.

The main changes required are:

  • Make the script return an exit status of 0 when no error has been encountered, and a different number (possibly the number of found errors) when something is wrong.
  • Generate the output in the stderr in a format appropriate for the CI (currently is a json file in stdout). As the code checks will soon be performed in Azure (BLD: Use Azure pipelines for linting #22844), the output format will ideally be customizable with a --format parameter, so we can call the script providing the azure format (equivalent to flake8 --format parameter).

As the number of errors is currently huge (8,404 errors in total), we'll have to add the validations to the CI in different stages. To allow that, the script will also have to support:

  • A --prefix parameter, accepting a string of the format pandas.Series.str. Meaning that only the functions/methods... starting by that string will be validated
  • A --errors parameter, accepting a error code, or comma separated list of error codes, that will be validated. This will also require codifying all the errors that we currently validate

May be it'd also be useful to have something like a --exclude parameter, but will leave this for later, once is needed for a specific case.

@jorisvandenbossche @WillAyd @jbrockmendel any feedback on this before I start working on it?

@datapythonista datapythonista added Docs CI Continuous Integration Code Style Code style, linting, code_checks labels Nov 4, 2018
@datapythonista datapythonista self-assigned this Nov 4, 2018
@WillAyd
Copy link
Member

WillAyd commented Nov 4, 2018

Makes sense to me. As far as the prefix goes do you see a logical way to target particular modules / package structures? Anything lower than that would probably be too detailed (even module-level might be) so if we identify a package to start could open that as a particular issue for the community to solve as a pre-cursor

@datapythonista
Copy link
Member Author

My first thought was on using files, as we have for the doctests. Then I realized that we don't get the docstrings from the files by instrospection, but we instead get the list of public objects from the docs as strings. So, my idea is to simply have an if obj_name.startswith(prefix):.

But I think the best we can do is start by something simple, and then see in practice if we need anything else or not. May adding a - before the prefix we can exclude those patterns. But I'll see when I have it implemented and try the first real cases.

@jorisvandenbossche
Copy link
Member

So, my idea is to simply have an if obj_name.startswith(prefix):.

That seems fine with me. Or more general could be a pattern. Not sure if such a more advanced selector will be needed though (maybe a use case would be to select/deselect a single method name accross Series/DataFrame(/Index). This might even be useful for interactive usage when working on a docstring).

@jorisvandenbossche
Copy link
Member

One general concern: how do you envision to use it on the CI? How do you now think we will specify which functions to run and which errors to check?

Because with the outline above (and the PR), you can specify a subset of functions and/or a set of error codes to check.

But I can imagine that at some point, we could need a higher granularity (like, for this function ignore this error code, for that function ignore that error code, ..). In a similar way that you can skip a specific a doctest with a flag, or have the special comments to overrule flake8 in specific cases.

Or the goal will be that every docstring will pass the validation completely? (I think we currently have some cases where we are fine with a certain error, because of limitations in the script, or sometimes making an exception might make something clearer / more readable, etc)

@datapythonista
Copy link
Member Author

That's an interesting discussion. I don't know right now, I think we'll have to experiment a bit once this is ready.

The first thing I'd like is to validate specific errors. For example, parameter mismatches and pep8 issues in the examples. I need to see how many of them we have, but if there are few for Series. I'll create issues for them, and once fixed add to the CI:

$ ./scripts/validate_docstrings.py --prefix="pandas.Series." --errors="PR01,PR02,PR03,EX03"

Excluding docstrings from the validation is likely to be useful. And I'd also like to be able to say --errors=EX and validate all the examples errors. But I prefer to take care of them in the PR where the script starts to be used in the CI (this ones is already too complex, and when adding to the CI is when I'll really know what else I need).

Does this make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Code Style Code style, linting, code_checks Docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants