-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use flake8 to check for PEP8 violations in doctests #23399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Use flake8 to check for PEP8 violations in doctests #23399
Conversation
Signed-off-by: Fabian Haase <[email protected]>
Hello @FHaase! Thanks for submitting the PR.
|
@FHaase : Thanks for the PR! Given your first comment in the description, I would recommend opening an issue first. That's generally what we would recommend instead of jumping right to the PR. |
Thanks @FHaase, this looks much better. Can you mention the issue you close, and the other PR related to this in the description please. I just saw couple of things. Doesn't feel like we should be using a module named legacy. I guess you checked it, but doesn't flake8 have a newer module if they call that one legacy? Also, feels like when we validate a docstring, we validate all the pep8 problems in the file, right? We should find a way to validate a single docstring. When this is done, we should also implement tests for this, as the rest of validations. In any case, good work, this will be very helpful once is done. |
Yes I mentioned it. The only way of fixing this might be to create a temp-file just with the docstring in it and run flake8 on that? |
Signed-off-by: Fabian Haase <[email protected]>
setup.cfg
Outdated
./pandas/computation | ||
./pandas/core | ||
./pandas/errors | ||
./pandas/io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of these have been removed recently, can you edit
* updated folders excluded from flake8 --doctests * fixed failing doctests Signed-off-by: Fabian Haase <[email protected]>
* Examples in `test_validate_docstrings.py` lacked imports * Removed signature from function in generated file as not really needed at this point. * Used `clean_doc` instead of `raw_doc` to get correct indentation. * Added missing try: finally: block in `_file_representation()` Signed-off-by: Fabian Haase <[email protected]>
9d7b896
to
762de5d
Compare
Codecov Report
@@ Coverage Diff @@
## master #23399 +/- ##
=======================================
Coverage 92.23% 92.23%
=======================================
Files 161 161
Lines 51197 51197
=======================================
Hits 47220 47220
Misses 3977 3977
Continue to review full report at Codecov.
|
46cd0a4
to
844065c
Compare
Dug down into the code and basically recreated the legacy api. |
8b5c95d
to
9c09bf1
Compare
* added test * simplified temporary file usage * included flake8 by calling application directly * removed doctests from setup.cfg Signed-off-by: Fabian Haase <[email protected]>
9c09bf1
to
9bc7f65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, mainly about code clarity and simplicity. But looks good, I think we're almost there.
@@ -584,6 +587,29 @@ def prefix_pandas(self): | |||
pass | |||
|
|||
|
|||
class BadExamples(object): | |||
|
|||
def doctest(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doctest
is not a very descriptive name if this test is to check whether pep8 should fail because of unknown np
. The rest of the example is very verbose to just check that. I'd use the description to explain what is this test about. And in the examples just with >>> np.nan
could be enough.
Then, couple more tests could be added, to test the typical pep8 issues we have, like >>> foo = 1+2
(missing spaces)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to use flake8 to check for pep8 issues was to not have to write all these tests. flake8 should be tested enough to work properly.
This test basically only verifies that flake8 is actually testing the docstrings.
scripts/validate_docstrings.py
Outdated
import tempfile | ||
from contextlib import contextmanager | ||
|
||
from flake8.main.application import Application as Flake8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it complicates things unnecessarily importing objects other than modules. import contextlib
and import flake8.main.application
seems like the simplest and best way. And then use accordinly @contextlib.contextmanager
and app = flake8.main.application.Application
. This way, just by reading those lines it's obvious what is the module and what is being used, and there is not need to go to the imports and start decoding aliases.
scripts/validate_docstrings.py
Outdated
] | ||
|
||
@contextmanager | ||
def _file_representation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not very descriptive. I think this is the examples as temporary file, so something like _examples_as_temp_file
seems much clearer.
Not sure if it's worth having two separate methods, I'd generate the file in the method that validates pep8. But if we have two methods, I'd have this first, as this is being used by the other (pep8_violations
).
scripts/validate_docstrings.py
Outdated
Temporarily creates file with current function inside. | ||
The signature and body are **not** included. | ||
|
||
:returns file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use numpydoc, not this syntax
scripts/validate_docstrings.py
Outdated
lines = self.clean_doc.split("\n") | ||
indented_lines = [(' ' * 4) + line if line else '' | ||
for line in lines[1:]] | ||
doc = '\n'.join([lines[0], *indented_lines]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused with this. I guess I'm missing something. If in our docstring we have something like:
Some desc.
Examples
--------
>>> val = 1+2
>>> print(val)
I'd expect to have in the temporary file to validate:
val = 1+2
print(val)
As we won't validate the description, or anything else. I think numpydoc returns the lines with code, so writing to the file should be a single line of code. Is there any advantage generating a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well flake8 supports doctests by adding --doctests
as an argument. And the Examples follow the doctest standard.
Examples
An optional section for examples, using the doctest format.
Having to carve out the examples is not trivial and would require writing multiple tests. So creating a file with a dummy-function is a simple hack, getting ugly cause it has to restore the indentation of the docstring. One test of the BadExamples
class ensures it is working.
scripts/validate_docstrings.py
Outdated
@@ -446,7 +486,7 @@ def validate_one(func_name): | |||
if doc.summary != doc.summary.lstrip(): | |||
errs.append('Summary contains heading whitespaces.') | |||
elif (doc.is_function_or_method | |||
and doc.summary.split(' ')[0][-1] == 's'): | |||
and doc.summary.split(' ')[0][-1] == 's'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be better to not do unrelated changes
scripts/validate_docstrings.py
Outdated
return [ | ||
"{} {} {}".format(s.count, s.error_code, s.message) | ||
for s in stats.statistics_for('') | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally prefer to yield from application.guide.stats.statistics_for('')
instead of returning the string. In case the information is used in more than one place, having the components can be useful.
Also, it's an opinion, but I wouldn't use a property here, as this feels more like an action than an attribute. A method validate_pep8
seems easier to understand.
Signed-off-by: Fabian Haase <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great. Still few comments. And we'll probably have to wait to #23161 to reuse the code extraction from the examples, but it should be merged very soon.
|
||
Examples | ||
-------- | ||
>>> import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize before, but there is something tricky here. The convention is to assume numpy and pandas are always imported before the examples. I personally would be explicit, but that's how all the examples are made.
This means two things:
- pep8 will fail because of this, unless we add the imports when validating
- we should use different examples for the tests (may be not imorting
math
?)
And as I said in the previous review, I'd like to have more examples of pep8 failures, like missing spaces around an operator. Also, these examples are too long for what is being tested.
scripts/validate_docstrings.py
Outdated
@@ -24,6 +24,10 @@ | |||
import inspect | |||
import importlib | |||
import doctest | |||
import tempfile | |||
|
|||
from flake8.main import application as flake8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, just import flake8.main.application
, and then use app = flake8.main.application.Application()
. It makes things confusing to use aliases, and even more if the name of the alias is the name of the root module of what is being imported.
scripts/validate_docstrings.py
Outdated
def validate_pep8(self): | ||
create_function = 'def {name}():\n' \ | ||
' """\n{examples}\n """\n' \ | ||
' pass\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said, I have a preference to have only the examples code in the temporary file. Extracting the code from the examples is trivial, and once #23161 is merged we'll have a property in this class that returns the lines of code.
What you will have to do is before writing the code in the examples, write import numpy as np
and same for pandas, so flake8 does not complain because of that.
scripts/validate_docstrings.py
Outdated
|
||
application = flake8.Application() | ||
application.initialize(["--doctests", "--quiet"]) | ||
application.run_checks([file.name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just leave the writing to the file and this line inside the context manager, and creating the app object, initializing... have it outside?
scripts/validate_docstrings.py
Outdated
@@ -490,6 +514,13 @@ def validate_one(func_name): | |||
for param_err in param_errs: | |||
errs.append('\t{}'.format(param_err)) | |||
|
|||
pep8_errs = [error for error in doc.validate_pep8()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is easier to read:
pep8_errs = [error for error in doc.validate_pep8()] | |
pep8_errs = list(doc.validate_pep8()) |
Co-Authored-By: FHaase <[email protected]>
Signed-off-by: Fabian Haase <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good, added a couple of comments about what to expect from #23161, and some style preferences, but we're almost there.
I'm really looking forward to have this merged. After that I'll add this validation to the CI, so we don't need to keep reviewing the pep8 of examples manually, which is a significant amount of work.
Thanks for all the work on this.
""" | ||
pass | ||
|
||
def pandas_imported(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the misunderstanding. numpy_imported
and pandas_imported
and the corresponding tests are not needed, as that is already tested in #23161.
As mentioned in the previous review, forget about these two imports in the examples in this PR. You are repeating what's implemented in #23161 which will be merged earlier.
' are imported as pd or np',)), | ||
('BadExamples', 'pandas_imported', | ||
('F811 It\'s assumed that pandas and numpy' | ||
' are imported as pd or np',)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two examples need to be removed
scripts/validate_docstrings.py
Outdated
""" | ||
content = ''.join(('import numpy as np; ' | ||
'import pandas as pd # noqa: F401,E702\n', | ||
*self.examples_source_code)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be simpler to join by \n
instead, and remove the semicolon and the noqa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- regarding E702: with one line the great thing was to achieve intercepting of the error message.
- regarding F401: All examples not using both: np and pd would fail because of an unused import.
The implementation of self.examples_source_code
of PR #23161 returns an array with \n
at the end, so joining with \n
isn't possible.
scripts/validate_docstrings.py
Outdated
if statistic.message.endswith('from line 1'): | ||
statistic.message = "It's assumed that pandas and numpy" \ | ||
" are imported as pd or np" | ||
yield statistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess with the original yield from
we may get a pep8 error of duplicate import, or something similar. I'm fine with that. Besides the pep8 errors, we'll have the error of the other section implemented in #23161, which should make it very easy for users to identify the problem.
So, I'd simply leave the yield from ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found F811 redefinition of unused 'np' from line 1
could be confusing when read first.
scripts/validate_docstrings.py
Outdated
application = flake8.main.application.Application() | ||
application.initialize(["--quiet"]) | ||
application.run_checks([file.name]) | ||
application.report() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear in my previous review about this. This is what I'd do:
application = flake8.main.application.Application()
application.initialize(["--quiet"])
with tempfile.NamedTemporaryFile(mode='w') as f:
f.write(content)
f.flush()
application.run_checks([f.name])
application.report()
I think it's a good practice to have inside the context manager, the code that requires to be there.
Also, as this file is going quite significantly, I think it's better to keep all this functionality in a single function, so each validation is somehow isolated. I think in another context it makes more sense what you did of splitting, but I don't think we'll reuse the temporary file creation, and I think the improvement in simplicity makes it worth.
scripts/validate_docstrings.py
Outdated
yield file | ||
|
||
def validate_pep8(self): | ||
if self.examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small thing, but my preference is having:
if not self.examples:
return
this avoids having the rest of the function indented, and in my opinion increases readability.
Signed-off-by: Fabian Haase <[email protected]>
…docstrings Signed-off-by: Fabian Haase <[email protected]>
Signed-off-by: Fabian Haase <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self.examples: | ||
return | ||
|
||
content = ''.join(('import numpy as np # noqa: F401\n', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can add a comment before this line explaining why the F401
is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the content basically never shows as it's only existing within tmp files. And if someone thinks it's unnecessary the tests would fail.
So I don't think commenting is needed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was not for the temp file, but our code, and I think it's useful to know what the F401
is about when reading this part of code (without having to remove it and see that fails). Anyway, this is merged now. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge when satisfied
Thanks @FHaase, extremely useful PR |
git diff upstream/master -u -- "*.py" | flake8 --diff
Improvement to PR #23381 regarding issue #23154
Enables
--doctests
flag, resulting in doctests being checked by flake8Uses flake8.api.legacy to invoke the tests.
Downsides are:
flake8.get_style_guide
don't have an effect. Thussetup.cfg
had to be enhanced.python ./scripts/validate_docstrings.py pandas.read_excel
Help in the form of
Use "flake8 pandas/util/_decorators.py" for further information.
could be added to the output: