DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

mroeschke · 2018-12-10T19:50:05Z

Discovered in #24194

Running

$ python scripts/validate_docstrings.py pandas.read_excel

Produces a file tmp.xlsx in the pandas directory from a docstring in read_excel. We don't want to be accidentally committing these files; therefore, validate_docstrings.py should be removing any temporary artifacts after runs.

cc @datapythonista

The text was updated successfully, but these errors were encountered:

datapythonista · 2018-12-10T20:25:36Z

I think the same happen when running doctests. In theory we added # doctest: +SKIP to all the code saving files in #23201.

That's probably not the best solution, as the comment is visible, and in some cases we open the files after generating them, and we need to skip larger parts of code.

Not quite sure what would be the best approach in the long term.

515hikaru · 2018-12-12T15:33:04Z

If you only want to prevent accidentally committing, how about adding tmp.xlsx in .gitignore?

I think it is dangerous to forcibly delete file.

FHaase · 2018-12-12T16:44:37Z

Maybe we could create them in a tmp directory like /tmp if they don't have to be in the pandas folders.

Geekly · 2018-12-13T17:17:59Z

It seems like them in a particular folder that's included in .gitignore is both flexible and backwards compatible. There are plenty of generated files that are already managed that way. What are the downsides?

datapythonista · 2018-12-13T17:39:34Z

Saving to /tmp/data.csv doesn't work in Windows, and would be nice to keep it multiplatform, so users can copy-paste examples and they run.

Using the module tempfile makes the code more complex and distracts the attention from the pandas functionality.

May be the best solution is to change the current directory in sphinx, so we can use something like df.to_csv('data/my_dataset.csv'), and we can control where this is saved (so it can be cleaned, or ignored in .gitignore). Not sure if that's easy to implement.

WillAyd · 2018-12-13T18:00:40Z

What if we don't use files and opt for StringIO / BytesIO objects instead? With Python2 getting dropped soon we should just be able to make calls to io.StringIO and io.BytesIO

.gitignore is not a great solution as all that does is prevent files from getting committed, but doesn't prevent us from creating files that we don't need to

Geekly · 2018-12-13T19:04:59Z

So the issue, although related to committing inappropriate files, isn't that explicitly. It sounds like you either want to control which files are created, or which files are automatically deleted before committing even becomes an issue.

datapythonista · 2018-12-13T23:16:05Z

In my opinion, in an example wants to show to a beginner user how to save a csv file, showing df.to_csv('my_file.csv') is very clear, but a tempfile, a StringIO... is not. They won't be able to copy-paste, not understand that just providing a string is enough, and they'll get distracted trying to understand tempfile and StringIO instead to .to_csv().

From our side of keeping things clean in the pandas repo directories, I agree tempfile or StringIO are probably the best options, but I think we need to make sure that the documentation shows what we want to show, and be as clear as possible.

mroeschke · 2018-12-14T00:05:24Z

I think that's a fair point @datapythonista.

The primary concern is avoiding committing/shipping unnecessary files (i.e. .gitignore). The ideal ask is removal of these files so we don't have essentially a "clutter" folder. So I'd be okay isolating all generated example files to a dedicated folder for now unless there's a straightforward way to remove these files once our documentation validation/testing scripts run.

mroeschke added Docs good first issue labels Dec 10, 2018

alanderex mentioned this issue Feb 2, 2019

Fix quotes position in pandas.core, typos and misspelled parameters. #25093

Merged

4 tasks

mroeschke removed the good first issue label Nov 14, 2021

mroeschke mentioned this issue Dec 1, 2021

DOC: validate_docstrings cleans up leftover files; doctest +SKIP file examples #44711

Merged

3 tasks

jreback added this to the 1.4 milestone Dec 2, 2021

jreback closed this as completed in #44711 Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

mroeschke commented Dec 10, 2018

datapythonista commented Dec 10, 2018

515hikaru commented Dec 12, 2018

FHaase commented Dec 12, 2018

Geekly commented Dec 13, 2018

datapythonista commented Dec 13, 2018

WillAyd commented Dec 13, 2018

Geekly commented Dec 13, 2018

datapythonista commented Dec 13, 2018

mroeschke commented Dec 14, 2018

DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

Comments

mroeschke commented Dec 10, 2018

datapythonista commented Dec 10, 2018

515hikaru commented Dec 12, 2018

FHaase commented Dec 12, 2018

Geekly commented Dec 13, 2018

datapythonista commented Dec 13, 2018

WillAyd commented Dec 13, 2018

Geekly commented Dec 13, 2018

datapythonista commented Dec 13, 2018

mroeschke commented Dec 14, 2018