Skip to content

Custom robots.txt support? #3161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
agjohnson opened this issue Oct 12, 2017 · 13 comments
Closed

Custom robots.txt support? #3161

agjohnson opened this issue Oct 12, 2017 · 13 comments
Labels
Accepted Accepted issue on our roadmap Feature New feature Needed: design decision A core team decision is required

Comments

@agjohnson
Copy link
Contributor

We've talked about blowing away the protected designation, so not sure if it makes sense to put special case on the protected privacy level, but maybe a separate option for docs that shouldn't be crawled?

@agjohnson agjohnson added the Needed: design decision A core team decision is required label Oct 12, 2017
@dend
Copy link

dend commented Mar 29, 2018

@agjohnson any momentum on this particular item? What is the current recommendation to NOINDEX/NOFOLLOW a site?

@agjohnson
Copy link
Contributor Author

At very least, we could kill our global robots.txt redirect in nginx and allow projects to contribute their own robots.txt via a static page in Sphinx

@agjohnson agjohnson added this to the Admin UX milestone Sep 19, 2018
@agjohnson agjohnson added the Accepted Accepted issue on our roadmap label Sep 19, 2018
@humitos
Copy link
Member

humitos commented Oct 11, 2018

@agjohnson what's the status of this issue?

I'm not sure to clearly understand what's the action needed here.

  1. if it's around Protected Privacy level, I think we can close it as won't fix since we are removing the privacy levels from Community site.
  2. if it's about giving our users a way to upload by themselves a robots.txt I think the solution that I proposed at Avoid having old versions of the docs indexed by search engines #2430 (comment) should work (there is an example of a repository in that conversation also) and we can close this issue.

If none of those are what you have in mind, please elaborate a little more what you are considering here.

@dasdachs
Copy link

dasdachs commented Oct 11, 2018

@humitos the solution provided in #2430 (comment) is not optimal:

  1. Your site can have only one robots.txt file.
  2. The robots.txt file must be located at the root of the website host that it applies to. For instance, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be placed in a subdirectory ( for example, at http://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.

Google support

I think the only viable option is using the "meta tags" method [1][2]. I am working on a workaround for Astropy's docs (refer to issue #7794 and pull request #7874).

I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.

@humitos
Copy link
Member

humitos commented Oct 11, 2018

@dasdachs I see. You are right.

I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.

If the workaround by using meta tags is a good one, maybe it's a good solution to be implemented by a sphinx extension. It's still a hack, but at least "an automatic one" 😬

After reading the docs you linked, I don't see a solution coming from Sphinx or without a hack, so I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML (or similar) and copy it at the root of the subdomain. Not sure if that's possible, though.

@humitos
Copy link
Member

humitos commented Oct 11, 2018

I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML

This is not trivial.

With that file, we will need to do:

  1. append our own set of rules to the custom robots.txt
  2. sync the result to all our web servers
    • since this file will be outside Sphinx output, we need adapt that code
  3. modify the nginx rule to try serving first the custom robots.txt from the project/version and as a fallback serve ours

This raise another problem: we have one subdomain with multiples versions but only one root place to serve the robots.txt file. Which one should we serve?

Being a "global setting" makes me doubt if it isn't better to add a text box in the admin where the user can paste the contents of that file or think something easier like that.

@stsewd
Copy link
Member

stsewd commented Oct 11, 2018

I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML

I doubt this will be on the yaml, as this is a per-project configuration rather than per-version

@dasdachs
Copy link

The hack I found could be quite simple (this): add meta tags to files you don't want indexed.
But because of the global robots.txt, it would have no affect (refering to this answer from Google). Some solution using YAML or a text box seems like the way to go.

@astrofrog
Copy link

Unfortunately, the idea of adding meta tags isn't really an ideal solution, because we can't add it to all the old versions we host. In the case of astropy for example, we host a lot of old versions based on GitHub tags, e.g.:

http://docs.astropy.org/en/v1.0/

We can't change all the tags in our GitHub repo for all the old versions, so any solution that involves changes to the repository are a no-go. The only real solution would be to be able to customize robots.txt from the RTD settings interface.

@humitos
Copy link
Member

humitos commented Jan 16, 2019

@dasdachs @astrofrog we just merged a PR that will allow to use a custom robots.txt. I will be deployed soon. Here are the docs: https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Please, after the deploy and following the docs let us know if it works as you expected.

@dasdachs
Copy link

@humitos This is amazing. Thanks for the great work!

@AmmaraAnis
Copy link

What is the best way to add a custom robots.txt file and sitemap.xml file to a readthedocs.com external domain?

@humitos
Copy link
Member

humitos commented Apr 22, 2020

@AmmaraAnis Hi! For robots.txt you can read this FAQ at https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Regarding, sitemap.xml there is no way to modify the default server at root yet (see #6938) although, you can change the Sitemap: entry in your robots.txt pointing to a custom one and that may work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Feature New feature Needed: design decision A core team decision is required
Projects
None yet
Development

No branches or pull requests

7 participants