Skip to content

DOC: Validate consistency of title capitalization #26941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Jun 19, 2019 · 12 comments · Fixed by #31114
Closed

DOC: Validate consistency of title capitalization #26941

datapythonista opened this issue Jun 19, 2019 · 12 comments · Fixed by #31114
Assignees
Labels
CI Continuous Integration Docs good first issue

Comments

@datapythonista
Copy link
Member

In #26933, we're making the capitalization of the title sections consisten. We use to have many titles capitalized as This is the Section Title, and we changed all them (probably few were forgotten) to This is the section title.

To keep this consistency, we should validate that the capitalization is correct in the CI. This can be done by extracting all the titles, and making sure that only the first letter of the sentence is uppercase, or words defined in a short list, like Series, DataFrame,...

I think this can be done in two ways:

  • As a sphinx extension that validates the titles as they are processed, and generates warnings if they are not (this will automatically fail the CI).
  • As an independent script

The first option should be simpler if sphinx can implement this as extension, but not sure if that's the case.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2019

main things to note here are proper names, e.g. PyTables, Python and IPython (prob some others)

@martinagvilas
Copy link
Contributor

I would like to give this a try, if that's ok.

@tonywu1999
Copy link
Contributor

Is this still an open issue?

@datapythonista
Copy link
Member Author

Yes, still pending, and would be great to get this fixed. Thanks!

@tonywu1999
Copy link
Contributor

tonywu1999 commented Jan 14, 2020

When you mention validating as a sphinx extension, do you mean creating a custom extension (like that done in the file 'doc/sphinxext/contributors.py')?

I'm also a little confused on the type of extension to create. I've read about Sphinx roles, Directives, and Builders, but I'm not sure if there's any specific one I should choose for this situation.

@datapythonista
Copy link
Member Author

I don't know much about sphinx extensions, and I find sphinx itself very confusing. But yes, I assumed that since sphinx is already parsing all the files, it could be possible in a custom extension like the contributors.py one to validate the titles.

But an independent script that parses all the files, extracts all the titles, and reports any with an unwanted capitalization is also an option.

@tonywu1999
Copy link
Contributor

take

@tonywu1999
Copy link
Contributor

At this point, I have been able to create a python script where given a .rst file, this file can parse through that .rst file, identify titles from the produced doctree, and determine which titles do not follow the capitalization convention mentioned above.

I've been using the doc/source/development/contributing.rst file as a test file to see if my code is working fine. When testing, I noticed my code labeled these titles as not following the capitalization convention:

Code Base:
Pre-Commit
Type Hints
Style Guidelines
Pandas-specific Types
Validating Type Hints

Before moving on, I was wondering if these titles are special in any way (i.e. proper names, etc.) or if they simply do not follow the capitalization convention.

Also, is there any place that I could get easy access to finding proper names? I thought of looking through pandas API reference (https://pandas.pydata.org/pandas-docs/stable/reference/index.html) but I wasn't sure how to approach finding proper names in that document.

Thanks!

@datapythonista
Copy link
Member Author

Thanks @tonywu1999, that sounds great. Those titles don't have anything special, and should be changed.

We don't have a list of proper names we want capitalized, we'll have to build that list dynamically, as we validate titles. Jeff mentioned few as examples: PyTables, Python and IPython. But not even sure if those appear in titles.

I think the way to move forward is to open a PR with your script, and you use it to validate couple of files from ci/code_checks.sh. You'll have to fix the titles in the files for the CI to pass, so we can merge the PR.

Once your PR is merged, we can open issues to fix and validate the rest of the files in the docs. Other people can help with this, there is a significant amount of titles to change.

What I'd do is that your script accepts a file, a list of files, or a directory to look for files in it recursively. So, these cases would all be valid:

./scripts/validate_rst_title_capitalization.py doc/source/index.rst
./scripts/validate_rst_title_capitalization.py doc/source/index.rst doc/source/ecosystem.rst
./scripts/validate_rst_title_capitalization.py doc/source/

Initially, we'll validate just a subset of files, and when we're done we'll just call the last command.

For the exceptions, I think the easiest is that your script has something like:

CAPITALIZATION_EXCEPTIONS = [
    "pandas",
    "NumFOCUS",
    "Python",
    ...
]

The words on the list will have to be in the exact capitalization as defined, no matter if they are the first word of the title, or a following one. The rest of the words should have the correct capitalization Xxxxx xxxxx xxxxx.

Does all this make sense?

@tonywu1999
Copy link
Contributor

I have a couple questions regarding your comment:

  1. Would the python script always be executing from the root/base of the repository? And is the script supposed to run without putting the keyword "python" in front of the command?
    (Ex: python ./scripts/validate_rst_title_capitalization.py doc/source/index.rst )
  2. What should I do when my script catches a title that is improperly formatted? Should I print the title? Output a warning/error message?
  3. What do you mean by validating a couple titles from ci/code_checks.sh? I looked inside that file and I'm not exactly sure what's going on in that file.

@datapythonista
Copy link
Member Author

For (1), I'd use as a reference the scripts in scripts/validate_*.py. I would prefer not to assume is being executed from anywhere (it's easy to not assume that). And I'd make the script executable, that's also easy.

For (2), you should output in the terminal (CI logs), a message as descriptive as possible, so when someone finds it in the CI, can easily understand and fix the problem. You also need to make the script return an exit code different from 0, so the process fails, and the CI fails. Again, you can use the mentioned scripts for reference.

In (3) I meant that when you've got the script, and you open the PR, you can add in ci/code_checks.sh a couple of calls to your script, to start validating the first files. Like calling scripts/validate_capitalization.py doc/source/getting_started/install.rst, and may be another file. This way we can see the script in action in the CI, together with the code in the PR. I'll also ask that initially you validate a file with errors (titles with wrong capitalization), so we can see in the CI that the script works as expected, and how the errors messages look like with real examples. After we see that, you'll have to fix the errors in the file, or remove the file from the validation, so the CI is green and we can merge.

Thanks!

tonywu1999 pushed a commit to tonywu1999/pandas that referenced this issue Jan 17, 2020
tonywu1999 pushed a commit to tonywu1999/pandas that referenced this issue Jan 17, 2020
tonywu1999 pushed a commit to tonywu1999/pandas that referenced this issue Jan 18, 2020
tonywu1999 pushed a commit to tonywu1999/pandas that referenced this issue Jan 18, 2020
@tonywu1999
Copy link
Contributor

Hi, I recently committed and made a pull request with the new script ( #31114 ), but I encountered multiple issues.

One big issue I'm having is suppressing the output of helper functions that I imported. In my script, I had created a context manager to suppress output, which worked when I ran the script on my local machine, but did not work on GitHub when code_checks.sh ran. Is there any way I can suppress the output of other helper functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Docs good first issue
Projects
None yet
4 participants