Feature Suggestion: Include All Pages in the XML Sitemaps #6903

eliasdabbas · 2020-04-14T19:16:08Z

Many times we update the documentation without making a new release (typos, better examples, etc.). Search engines don't get notified about these changes, because sitemaps currently only contain the main pages of projects.

Including all the URLs of a project would have two benefits:

1. Better indexing for projects: Search engines know about those changes immediately, leading to better and timely indexing of the updated pages with more accurate information.

2. Optimize the bandwidth for the RTD website: When every page is included in the sitemap, and contains the lastmod tag, then search engines won't unnecessarily crawl the pages that it had already crawled.

I'm not very familiar with Django, but more than happy to help in any way I can if you think this would be a good thing to have.

Thanks!

The text was updated successfully, but these errors were encountered:

humitos · 2020-04-21T11:15:42Z

Hi @eliasdabbas! I'm not too familiarized with how including every single page in the sitemap.xml works. Do you have some documentation to point me that I can read about this?

Our current sitemap uses daily for latest version, and weekly for stable versions in the the changefreq attribute which seems reasonable, I'd say. However, I'm interested in knowing more about the lastmod on each URL and how this would affect the search engines in comparison with our current solution.

eliasdabbas · 2020-04-21T12:24:29Z

Hi @humitos

Sure. The RTD sitemap contains only two pages in it. But it does not include links to Configuration or Custom Domains for example. Ideally, I think the sitemap should include all those sub-pages (if this is what you meant by how it works?).

Now imagine writing a procedure to crawl the pages of readthedocs.io, where Google has around 900k pages indexed. How would you know when to re-crawl pages? All of them, everyday, weekly, some of them, which?

If all 900k URLs were included in the sitemap(s), with the lastmod tag, then Google can compare the current lastmod with the one it has saved from the previous crawl, and crawl accordingly.

From the Google sitemap documentation:

Google reads the <lastmod> value, but if you misrepresent this value, we will stop reading it.

An article from Search Engine Journal has some additional info:
https://www.searchenginejournal.com/technical-seo/xml-sitemaps/#close

This tweet from a leading Google Webmaster Trends Analyst says:

The URL + last modification date is what we care about for websearch.

Crawling works approximately like this (search engines also discover pages through links as well of course):

Visit /robots.txt
Get the crawling rules, and links to sitemaps
Follow all links in the sitemaps
Save copy of sitemaps
Repeat (crawling only newly-created or updated URLs)

Hope this makes sense. If you need any further clarifications or details, please let me know.
Thanks!

humitos · 2020-04-21T14:47:04Z

Taking the example of RTD's sitemap.xml,

  <url>
    <loc>https://docs.readthedocs.io/en/stable/</loc>
    <lastmod>2020-04-15T16:12:15.193726+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1</priority>
  </url>

Aren't we communicating there that all URLs under /en/stable/ have to be re-crawled in a weekly basis?

If all 900k URLs were included in the sitemap(s), with the lastmod tag, then Google can compare the current lastmod with the one it has saved from the previous crawl, and crawl accordingly.

We don't track the modification time of each file in the documentation. In the RTD's example the lastmod field is the date of the last time that stable version was built.

eliasdabbas · 2020-04-21T16:45:31Z

In the example, we are explicitly only saying that this particular URL has changed at this particular time. Implicitly, Google will visit this page, follow the links it finds and probably follow them all. But it doesn't necessarily do so, because it also needs to optimize its cost.
It actually happened with me, I updated some pages, but they weren't reflected in Google's index.

Yes, lastmod is based on when the version was built, so in my suggestion we would
go through the rst and/or md files and register their modified times. I believe this would be done in ServeSitemapXMLBase.

humitos · 2020-04-21T17:06:46Z

Yes, lastmod is based on when the version was built, so in my suggestion we would go through the rst and/or md files and register their modified times.

I think we can't do that. Or at least, with that granularity on each file's modification time. All the files will have the same modification, since we start each build from fresh and re-upload all the files to the storage even if they didn't change.

I believe this would be done in ServeSitemapXMLBase.

I think it's better to implement #6938 to give the users the ability to build the sitemap.xml as they want instead.

eliasdabbas · 2020-04-21T17:17:01Z

I think it's better to implement #6938 to give the users the ability to build the sitemap.xml as they want instead.

I think that's a very good idea. This way people can get the default implementation, or customize one if they want.

eliasdabbas · 2020-04-23T06:33:37Z

@humitos
I had a few thoughts about this that I'd like to share:

Custom sitemaps: The sitemaps spec is very limited, containing basically a list of URLs, with optional lastmod, changefreq, and priority tags. Furthermore, I'm not sure what customization people would want to implement other than making sure their newly created (or updated) pages get indexed as soon as possible on Google (which would be resolved if we add all URLs with their lastmods.
I quickly checked the number of issues containing "sitemap", and it turns out there were 13 out of more than 3.5k in the project's lifetime. So, I don't think it is a high priority for most users.

Your suggestion to give access to robots.txt is actually much better as you mention in #6938

I think we can't do that. Or at least, with that granularity on each file's modification time. All the files will have the same modification, since we start each build from fresh and re-upload all the files to the storage even if they didn't change.

I checked my project's files, and the good news is that the rst files have different modification times even though the documentation was built a few days ago. So, I took a quick shot at an approach for implementing this in the ServeSitemapXMLBase class:

I'm aware that there are other files that affect the docs, e.g. the .py modules, which we can incorporate but this is a simple solution suggestion.

Get all file in the docs/ directory ending with any of the supported source_suffix extensions from conf.py.
Iterate through them and add their modification times to the sitemap

# current code ############################################
versions = []
for version, priority, changefreq in zip(
        sorted_versions,
        priorities_generator(),
        changefreqs_generator(),
):
    element = {
        'loc': version.get_subdomain_url(),
        'priority': priority,
        'changefreq': changefreq,
        'languages': [],
    }

    # Version can be enabled, but not ``built`` yet. We want to show the
    # link without a ``lastmod`` attribute
    last_build = version.builds.order_by('-date').first()
    if last_build:
        element['lastmod'] = last_build.date.isoformat()

############################################################


# Suggested changes:

from path.to.docs.conf import source_suffix
if isinstance(source_suffix, str):
    source_suffix = [source_suffix]

project_inner_pages = [file for file in os.listdir(path.to.docs)
             if file.split('.')[-1] in source_suffix]
project_inner_pages = [os.path.abspath(file) for file in project_inner_pages]

versions = []
for version, priority, changefreq in zip(
        # loop over `sorted_versions` and `inner_pages`
        sorted_versions + project_inner_pages,
        priorities_generator(),
        changefreqs_generator(),
):
    element = {
        #  construct URL by appending file (this will need some modification)
        'loc': version.get_subdomain_url() + version if version in project_inner_pages else '',
        'priority': priority,
        'changefreq': changefreq,
        'languages': [],
        #  change it if it's one of the `inner_pages` else do it later
        'lastmod': os.path.getmtime(version) if version in project_inner_pages else None
    }

    # Version can be enabled, but not ``built`` yet. We want to show the
    # link without a ``lastmod`` attribute
    last_build = version.builds.order_by('-date').first()
    if last_build and version in sorted_versions:
        element['lastmod'] = last_build.date.isoformat()

humitos · 2020-04-23T13:36:23Z

I quickly checked the number of issues containing "sitemap", and it turns out there were 13 out of more than 3.5k in the project's lifetime. So, I don't think it is a high priority for most users.

This is probably the main reason why this does not exist yet. Sitemap were introduced in the last year or so, I believe. We don't have too much knowledge about how the impact was if any.

I checked my project's files, and the good news is that the rst files have different modification times even though the documentation was built a few days ago. So, I took a quick shot at an approach for implementing this in the ServeSitemapXMLBase class:

I'm not convinced that it's good to implement and maintain this inside the RTD's code. RTD already provides a way to support custom sitemaps (defining it in the robots.txt as mentioned in #6938) so, the code that you are suggesting is probably a better fit for sphinx-sitemap extension instead of Read the Docs' source code.

The view ServeSitemapXMLBase is executed dynamically on each request (if not cached), but the modification times and everything will be exactly the same unless there was a new build. In this case, instead of generating that response dynamically, it's better to do it just once at build time and then just serve the static sitemap.xml generated by that extension.

Summarizing, IMHO, I'd say that it's a good feature but it's a better fit for the sphinx-sitemap extension than Read the Docs itself.

humitos · 2020-04-23T13:43:54Z

I quickly checked the number of issues containing "sitemap", and it turns out there were 13 out of more than 3.5k in the project's lifetime. So, I don't think it is a high priority for most users.

This is probably the main reason why this does not exist yet. Sitemap were introduced in the last year or so, I believe. We don't have too much knowledge about how the impact was if any.

I checked my project's files, and the good news is that the rst files have different modification times even though the documentation was built a few days ago. So, I took a quick shot at an approach for implementing this in the ServeSitemapXMLBase class:

I'm not convinced that it's good to implement and maintain this inside the RTD's code. RTD already provides a way to support custom sitemaps (defining it in the robots.txt as mentioned in #6938) so, the code that you are suggesting is probably a better fit for sphinx-sitemap extension instead of Read the Docs' source code.

The view ServeSitemapXMLBase is executed dynamically on each request (if not cached), but the modification times and everything will be exactly the same unless there was a new build. In this case, instead of generating that response dynamically, it's better to do it just once at build time and then just serve the static sitemap.xml generated by that extension.

The dynamic part needed here is the knowledge of what versions are active at the moment the request is made. In that case, I'd put more effort in RTD generating a sitemap index (see #5391 and https://www.sitemaps.org/protocol.html#index) and listing all the per-version sitemap.xml generated by the sphinx-sitemap extension on each of the version.

Summarizing, IMHO, I'd say that it's a good feature but it's a better fit for the sphinx-sitemap extension than Read the Docs itself.

eliasdabbas · 2020-04-23T17:44:26Z

Thanks @humitos !

I didn't realize the sitemaps were a new feature.
I see your point regarding the implementation, and that it fits better in a separate project. Just wanted to check if there was a simple solution for it.

Thanks again.

stsewd added Improvement Minor improvement to code Needed: design decision A core team decision is required labels Apr 14, 2020

humitos mentioned this issue Apr 14, 2020

Generate a sitemap index / Allow custom sitemap.xml #5391

Closed

humitos added Needed: more information A reply from issue author is required and removed Needed: design decision A core team decision is required labels Apr 21, 2020

no-response bot removed the Needed: more information A reply from issue author is required label Apr 21, 2020

humitos mentioned this issue Apr 21, 2020

Add support for custom sitemap.xml generated by the user #6938

Closed

eliasdabbas closed this as completed Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Suggestion: Include All Pages in the XML Sitemaps #6903

Feature Suggestion: Include All Pages in the XML Sitemaps #6903

eliasdabbas commented Apr 14, 2020

humitos commented Apr 21, 2020

eliasdabbas commented Apr 21, 2020

humitos commented Apr 21, 2020

eliasdabbas commented Apr 21, 2020

humitos commented Apr 21, 2020

eliasdabbas commented Apr 21, 2020

eliasdabbas commented Apr 23, 2020

humitos commented Apr 23, 2020

humitos commented Apr 23, 2020

eliasdabbas commented Apr 23, 2020

Feature Suggestion: Include All Pages in the XML Sitemaps #6903

Feature Suggestion: Include All Pages in the XML Sitemaps #6903

Comments

eliasdabbas commented Apr 14, 2020

humitos commented Apr 21, 2020

eliasdabbas commented Apr 21, 2020

humitos commented Apr 21, 2020

eliasdabbas commented Apr 21, 2020

humitos commented Apr 21, 2020

eliasdabbas commented Apr 21, 2020

eliasdabbas commented Apr 23, 2020

humitos commented Apr 23, 2020

humitos commented Apr 23, 2020

eliasdabbas commented Apr 23, 2020