Skip to content

DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Dec 10, 2018 · 9 comments · Fixed by #44711
Closed

DOC: scripts/validate_docstrings.py should clean up files created from docstrings #24209

mroeschke opened this issue Dec 10, 2018 · 9 comments · Fixed by #44711
Labels
Milestone

Comments

@mroeschke
Copy link
Member

Discovered in #24194

Running

$ python scripts/validate_docstrings.py pandas.read_excel

Produces a file tmp.xlsx in the pandas directory from a docstring in read_excel. We don't want to be accidentally committing these files; therefore, validate_docstrings.py should be removing any temporary artifacts after runs.

cc @datapythonista

@datapythonista
Copy link
Member

I think the same happen when running doctests. In theory we added # doctest: +SKIP to all the code saving files in #23201.

That's probably not the best solution, as the comment is visible, and in some cases we open the files after generating them, and we need to skip larger parts of code.

Not quite sure what would be the best approach in the long term.

@515hikaru
Copy link

If you only want to prevent accidentally committing, how about adding tmp.xlsx in .gitignore?

I think it is dangerous to forcibly delete file.

@FHaase
Copy link
Contributor

FHaase commented Dec 12, 2018

Maybe we could create them in a tmp directory like /tmp if they don't have to be in the pandas folders.

@Geekly
Copy link

Geekly commented Dec 13, 2018

It seems like them in a particular folder that's included in .gitignore is both flexible and backwards compatible. There are plenty of generated files that are already managed that way. What are the downsides?

@datapythonista
Copy link
Member

Saving to /tmp/data.csv doesn't work in Windows, and would be nice to keep it multiplatform, so users can copy-paste examples and they run.

Using the module tempfile makes the code more complex and distracts the attention from the pandas functionality.

May be the best solution is to change the current directory in sphinx, so we can use something like df.to_csv('data/my_dataset.csv'), and we can control where this is saved (so it can be cleaned, or ignored in .gitignore). Not sure if that's easy to implement.

@WillAyd
Copy link
Member

WillAyd commented Dec 13, 2018

What if we don't use files and opt for StringIO / BytesIO objects instead? With Python2 getting dropped soon we should just be able to make calls to io.StringIO and io.BytesIO

.gitignore is not a great solution as all that does is prevent files from getting committed, but doesn't prevent us from creating files that we don't need to

@Geekly
Copy link

Geekly commented Dec 13, 2018

So the issue, although related to committing inappropriate files, isn't that explicitly. It sounds like you either want to control which files are created, or which files are automatically deleted before committing even becomes an issue.

@datapythonista
Copy link
Member

In my opinion, in an example wants to show to a beginner user how to save a csv file, showing df.to_csv('my_file.csv') is very clear, but a tempfile, a StringIO... is not. They won't be able to copy-paste, not understand that just providing a string is enough, and they'll get distracted trying to understand tempfile and StringIO instead to .to_csv().

From our side of keeping things clean in the pandas repo directories, I agree tempfile or StringIO are probably the best options, but I think we need to make sure that the documentation shows what we want to show, and be as clear as possible.

@mroeschke
Copy link
Member Author

I think that's a fair point @datapythonista.

The primary concern is avoiding committing/shipping unnecessary files (i.e. .gitignore). The ideal ask is removal of these files so we don't have essentially a "clutter" folder. So I'd be okay isolating all generated example files to a dedicated folder for now unless there's a straightforward way to remove these files once our documentation validation/testing scripts run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants