-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Feature Suggestion: Include All Pages in the XML Sitemaps #6903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @eliasdabbas! I'm not too familiarized with how including every single page in the sitemap.xml works. Do you have some documentation to point me that I can read about this? Our current sitemap uses |
Hi @humitos Sure. The RTD sitemap contains only two pages in it. But it does not include links to Configuration or Custom Domains for example. Ideally, I think the sitemap should include all those sub-pages (if this is what you meant by how it works?). Now imagine writing a procedure to crawl the pages of readthedocs.io, where Google has around 900k pages indexed. How would you know when to re-crawl pages? All of them, everyday, weekly, some of them, which? If all 900k URLs were included in the sitemap(s), with the From the Google sitemap documentation:
An article from Search Engine Journal has some additional info: This tweet from a leading Google Webmaster Trends Analyst says:
Crawling works approximately like this (search engines also discover pages through links as well of course):
Hope this makes sense. If you need any further clarifications or details, please let me know. |
Taking the example of RTD's sitemap.xml, <url>
<loc>https://docs.readthedocs.io/en/stable/</loc>
<lastmod>2020-04-15T16:12:15.193726+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>1</priority>
</url> Aren't we communicating there that all URLs under
We don't track the modification time of each file in the documentation. In the RTD's example the |
In the example, we are explicitly only saying that this particular URL has changed at this particular time. Implicitly, Google will visit this page, follow the links it finds and probably follow them all. But it doesn't necessarily do so, because it also needs to optimize its cost. Yes, |
I think we can't do that. Or at least, with that granularity on each file's modification time. All the files will have the same modification, since we start each build from fresh and re-upload all the files to the storage even if they didn't change.
I think it's better to implement #6938 to give the users the ability to build the sitemap.xml as they want instead. |
I think that's a very good idea. This way people can get the default implementation, or customize one if they want. |
@humitos
Your suggestion to give access to
I checked my project's files, and the good news is that the I'm aware that there are other files that affect the docs, e.g. the
# current code ############################################
versions = []
for version, priority, changefreq in zip(
sorted_versions,
priorities_generator(),
changefreqs_generator(),
):
element = {
'loc': version.get_subdomain_url(),
'priority': priority,
'changefreq': changefreq,
'languages': [],
}
# Version can be enabled, but not ``built`` yet. We want to show the
# link without a ``lastmod`` attribute
last_build = version.builds.order_by('-date').first()
if last_build:
element['lastmod'] = last_build.date.isoformat()
############################################################
# Suggested changes:
from path.to.docs.conf import source_suffix
if isinstance(source_suffix, str):
source_suffix = [source_suffix]
project_inner_pages = [file for file in os.listdir(path.to.docs)
if file.split('.')[-1] in source_suffix]
project_inner_pages = [os.path.abspath(file) for file in project_inner_pages]
versions = []
for version, priority, changefreq in zip(
# loop over `sorted_versions` and `inner_pages`
sorted_versions + project_inner_pages,
priorities_generator(),
changefreqs_generator(),
):
element = {
# construct URL by appending file (this will need some modification)
'loc': version.get_subdomain_url() + version if version in project_inner_pages else '',
'priority': priority,
'changefreq': changefreq,
'languages': [],
# change it if it's one of the `inner_pages` else do it later
'lastmod': os.path.getmtime(version) if version in project_inner_pages else None
}
# Version can be enabled, but not ``built`` yet. We want to show the
# link without a ``lastmod`` attribute
last_build = version.builds.order_by('-date').first()
if last_build and version in sorted_versions:
element['lastmod'] = last_build.date.isoformat() |
This is probably the main reason why this does not exist yet. Sitemap were introduced in the last year or so, I believe. We don't have too much knowledge about how the impact was if any.
I'm not convinced that it's good to implement and maintain this inside the RTD's code. RTD already provides a way to support custom sitemaps (defining it in the The view Summarizing, IMHO, I'd say that it's a good feature but it's a better fit for the sphinx-sitemap extension than Read the Docs itself. |
This is probably the main reason why this does not exist yet. Sitemap were introduced in the last year or so, I believe. We don't have too much knowledge about how the impact was if any.
I'm not convinced that it's good to implement and maintain this inside the RTD's code. RTD already provides a way to support custom sitemaps (defining it in the The view The dynamic part needed here is the knowledge of what versions are active at the moment the request is made. In that case, I'd put more effort in RTD generating a sitemap index (see #5391 and https://www.sitemaps.org/protocol.html#index) and listing all the per-version sitemap.xml generated by the sphinx-sitemap extension on each of the version. Summarizing, IMHO, I'd say that it's a good feature but it's a better fit for the sphinx-sitemap extension than Read the Docs itself. |
Thanks @humitos ! I didn't realize the sitemaps were a new feature. Thanks again. |
Many times we update the documentation without making a new release (typos, better examples, etc.). Search engines don't get notified about these changes, because sitemaps currently only contain the main pages of projects.
Including all the URLs of a project would have two benefits:
1. Better indexing for projects: Search engines know about those changes immediately, leading to better and timely indexing of the updated pages with more accurate information.
2. Optimize the bandwidth for the RTD website: When every page is included in the sitemap, and contains the
lastmod
tag, then search engines won't unnecessarily crawl the pages that it had already crawled.I'm not very familiar with Django, but more than happy to help in any way I can if you think this would be a good thing to have.
Thanks!
The text was updated successfully, but these errors were encountered: