-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
sitemap.xml contains too many versions #11584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's hard to make a decision that works for everybody here; since projects use the versions in a pretty different, and sometimes, unexpected ways. However...
I think a good compromise could be to not expose hidden versions in We can filter out these hidden versions in readthedocs.org/readthedocs/proxito/views/serve.py Lines 859 to 862 in fa36fd0
Isn't this possible with something like https://github.com/jdillard/sphinx-sitemap?
Should we remove this chunk of code form our robots since it's the recommended way? readthedocs.org/readthedocs/templates/robots.txt Lines 2 to 6 in fa36fd0
|
I had a user comment on hidden versions as well. I think removing hidden versions would be a helpful change here. |
Details
Expected Result
RTD should include only a handful of versions (which ones? maybe just
/stable
and/latest
?), so that users don't arrive to old ones from search engines.See for example what https://docs.rs/ does:
(from https://docs.rs/-/sitemap/c/sitemap.xml and https://docs.rs/sitemap.xml)
Actual Result
RTD generates a
sitemap.xml
with every version ever released, even versions that are hidden in the flyout.See for example: http://web.archive.org/web/20230524052604/https://docs.kedro.org/sitemap.xml (most recent version in Web Archive)
Or the current one:
This issue has high impact on SEO, given that
robots.txt
is not the way to hide pages from search results, according to Google 1 2 3sitemap.xml
that has proper<lastmod>
dates for all past versions, and very difficult to do so for/latest
.noindex
tags for all documentation versions, as suggested in Add meta tags "noindex, nofollow" for hidden version #10648, is very hard if old versions of the project weren't ready for it.robots.txt
that blocks hidden versions from bots, which therefore cannot see thenoindex
tag, hence forcing projects to generate their own manually craftedrobots.txt
.Footnotes
"If you do want to block this page from Google Search, robots.txt is not the correct mechanism to avoid being indexed. To avoid being indexed, remove the robots.txt block and use 'noindex'." (source) ↩
"Warning: Don't use a robots.txt file as a means to hide your web pages (including PDFs and other text-based formats supported by Google) from Google search results." (source) ↩
"If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it." (source) ↩
The text was updated successfully, but these errors were encountered: