Skip to content

Feature Suggestion: Include All Pages in the XML Sitemaps #6903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eliasdabbas opened this issue Apr 14, 2020 · 10 comments
Closed

Feature Suggestion: Include All Pages in the XML Sitemaps #6903

eliasdabbas opened this issue Apr 14, 2020 · 10 comments
Labels
Improvement Minor improvement to code

Comments

@eliasdabbas
Copy link

Many times we update the documentation without making a new release (typos, better examples, etc.). Search engines don't get notified about these changes, because sitemaps currently only contain the main pages of projects.

Including all the URLs of a project would have two benefits:

1. Better indexing for projects: Search engines know about those changes immediately, leading to better and timely indexing of the updated pages with more accurate information.

2. Optimize the bandwidth for the RTD website: When every page is included in the sitemap, and contains the lastmod tag, then search engines won't unnecessarily crawl the pages that it had already crawled.

I'm not very familiar with Django, but more than happy to help in any way I can if you think this would be a good thing to have.

Thanks!

@stsewd stsewd added Improvement Minor improvement to code Needed: design decision A core team decision is required labels Apr 14, 2020
@humitos
Copy link
Member

humitos commented Apr 21, 2020

Hi @eliasdabbas! I'm not too familiarized with how including every single page in the sitemap.xml works. Do you have some documentation to point me that I can read about this?

Our current sitemap uses daily for latest version, and weekly for stable versions in the the changefreq attribute which seems reasonable, I'd say. However, I'm interested in knowing more about the lastmod on each URL and how this would affect the search engines in comparison with our current solution.

@humitos humitos added Needed: more information A reply from issue author is required and removed Needed: design decision A core team decision is required labels Apr 21, 2020
@eliasdabbas
Copy link
Author

Hi @humitos

Sure. The RTD sitemap contains only two pages in it. But it does not include links to Configuration or Custom Domains for example. Ideally, I think the sitemap should include all those sub-pages (if this is what you meant by how it works?).

Now imagine writing a procedure to crawl the pages of readthedocs.io, where Google has around 900k pages indexed. How would you know when to re-crawl pages? All of them, everyday, weekly, some of them, which?

If all 900k URLs were included in the sitemap(s), with the lastmod tag, then Google can compare the current lastmod with the one it has saved from the previous crawl, and crawl accordingly.

From the Google sitemap documentation:

Google reads the <lastmod> value, but if you misrepresent this value, we will stop reading it.

An article from Search Engine Journal has some additional info:
https://www.searchenginejournal.com/technical-seo/xml-sitemaps/#close

This tweet from a leading Google Webmaster Trends Analyst says:

The URL + last modification date is what we care about for websearch.

Crawling works approximately like this (search engines also discover pages through links as well of course):

  1. Visit /robots.txt
  2. Get the crawling rules, and links to sitemaps
  3. Follow all links in the sitemaps
  4. Save copy of sitemaps
  5. Repeat (crawling only newly-created or updated URLs)

Hope this makes sense. If you need any further clarifications or details, please let me know.
Thanks!

@no-response no-response bot removed the Needed: more information A reply from issue author is required label Apr 21, 2020
@humitos
Copy link
Member

humitos commented Apr 21, 2020

Taking the example of RTD's sitemap.xml,

  <url>
    <loc>https://docs.readthedocs.io/en/stable/</loc>
    <lastmod>2020-04-15T16:12:15.193726+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1</priority>
  </url>

Aren't we communicating there that all URLs under /en/stable/ have to be re-crawled in a weekly basis?

If all 900k URLs were included in the sitemap(s), with the lastmod tag, then Google can compare the current lastmod with the one it has saved from the previous crawl, and crawl accordingly.

We don't track the modification time of each file in the documentation. In the RTD's example the lastmod field is the date of the last time that stable version was built.

@eliasdabbas
Copy link
Author

In the example, we are explicitly only saying that this particular URL has changed at this particular time. Implicitly, Google will visit this page, follow the links it finds and probably follow them all. But it doesn't necessarily do so, because it also needs to optimize its cost.
It actually happened with me, I updated some pages, but they weren't reflected in Google's index.

Yes, lastmod is based on when the version was built, so in my suggestion we would
go through the rst and/or md files and register their modified times. I believe this would be done in ServeSitemapXMLBase.

@humitos
Copy link
Member

humitos commented Apr 21, 2020

Yes, lastmod is based on when the version was built, so in my suggestion we would go through the rst and/or md files and register their modified times.

I think we can't do that. Or at least, with that granularity on each file's modification time. All the files will have the same modification, since we start each build from fresh and re-upload all the files to the storage even if they didn't change.

I believe this would be done in ServeSitemapXMLBase.

I think it's better to implement #6938 to give the users the ability to build the sitemap.xml as they want instead.

@eliasdabbas
Copy link
Author

I think it's better to implement #6938 to give the users the ability to build the sitemap.xml as they want instead.

I think that's a very good idea. This way people can get the default implementation, or customize one if they want.

@eliasdabbas
Copy link
Author

@humitos
I had a few thoughts about this that I'd like to share:

  • Custom sitemaps: The sitemaps spec is very limited, containing basically a list of URLs, with optional lastmod, changefreq, and priority tags. Furthermore, I'm not sure what customization people would want to implement other than making sure their newly created (or updated) pages get indexed as soon as possible on Google (which would be resolved if we add all URLs with their lastmods.
    I quickly checked the number of issues containing "sitemap", and it turns out there were 13 out of more than 3.5k in the project's lifetime. So, I don't think it is a high priority for most users.

Your suggestion to give access to robots.txt is actually much better as you mention in #6938

I think we can't do that. Or at least, with that granularity on each file's modification time. All the files will have the same modification, since we start each build from fresh and re-upload all the files to the storage even if they didn't change.

I checked my project's files, and the good news is that the rst files have different modification times even though the documentation was built a few days ago. So, I took a quick shot at an approach for implementing this in the ServeSitemapXMLBase class:

I'm aware that there are other files that affect the docs, e.g. the .py modules, which we can incorporate but this is a simple solution suggestion.

  1. Get all file in the docs/ directory ending with any of the supported source_suffix extensions from conf.py.
  2. Iterate through them and add their modification times to the sitemap
# current code ############################################
versions = []
for version, priority, changefreq in zip(
        sorted_versions,
        priorities_generator(),
        changefreqs_generator(),
):
    element = {
        'loc': version.get_subdomain_url(),
        'priority': priority,
        'changefreq': changefreq,
        'languages': [],
    }

    # Version can be enabled, but not ``built`` yet. We want to show the
    # link without a ``lastmod`` attribute
    last_build = version.builds.order_by('-date').first()
    if last_build:
        element['lastmod'] = last_build.date.isoformat()

############################################################


# Suggested changes:

from path.to.docs.conf import source_suffix
if isinstance(source_suffix, str):
    source_suffix = [source_suffix]

project_inner_pages = [file for file in os.listdir(path.to.docs)
             if file.split('.')[-1] in source_suffix]
project_inner_pages = [os.path.abspath(file) for file in project_inner_pages]

versions = []
for version, priority, changefreq in zip(
        # loop over `sorted_versions` and `inner_pages`
        sorted_versions + project_inner_pages,
        priorities_generator(),
        changefreqs_generator(),
):
    element = {
        #  construct URL by appending file (this will need some modification)
        'loc': version.get_subdomain_url() + version if version in project_inner_pages else '',
        'priority': priority,
        'changefreq': changefreq,
        'languages': [],
        #  change it if it's one of the `inner_pages` else do it later
        'lastmod': os.path.getmtime(version) if version in project_inner_pages else None
    }

    # Version can be enabled, but not ``built`` yet. We want to show the
    # link without a ``lastmod`` attribute
    last_build = version.builds.order_by('-date').first()
    if last_build and version in sorted_versions:
        element['lastmod'] = last_build.date.isoformat()

@humitos
Copy link
Member

humitos commented Apr 23, 2020

I quickly checked the number of issues containing "sitemap", and it turns out there were 13 out of more than 3.5k in the project's lifetime. So, I don't think it is a high priority for most users.

This is probably the main reason why this does not exist yet. Sitemap were introduced in the last year or so, I believe. We don't have too much knowledge about how the impact was if any.

I checked my project's files, and the good news is that the rst files have different modification times even though the documentation was built a few days ago. So, I took a quick shot at an approach for implementing this in the ServeSitemapXMLBase class:

I'm not convinced that it's good to implement and maintain this inside the RTD's code. RTD already provides a way to support custom sitemaps (defining it in the robots.txt as mentioned in #6938) so, the code that you are suggesting is probably a better fit for sphinx-sitemap extension instead of Read the Docs' source code.

The view ServeSitemapXMLBase is executed dynamically on each request (if not cached), but the modification times and everything will be exactly the same unless there was a new build. In this case, instead of generating that response dynamically, it's better to do it just once at build time and then just serve the static sitemap.xml generated by that extension.

Summarizing, IMHO, I'd say that it's a good feature but it's a better fit for the sphinx-sitemap extension than Read the Docs itself.

@humitos
Copy link
Member

humitos commented Apr 23, 2020

I quickly checked the number of issues containing "sitemap", and it turns out there were 13 out of more than 3.5k in the project's lifetime. So, I don't think it is a high priority for most users.

This is probably the main reason why this does not exist yet. Sitemap were introduced in the last year or so, I believe. We don't have too much knowledge about how the impact was if any.

I checked my project's files, and the good news is that the rst files have different modification times even though the documentation was built a few days ago. So, I took a quick shot at an approach for implementing this in the ServeSitemapXMLBase class:

I'm not convinced that it's good to implement and maintain this inside the RTD's code. RTD already provides a way to support custom sitemaps (defining it in the robots.txt as mentioned in #6938) so, the code that you are suggesting is probably a better fit for sphinx-sitemap extension instead of Read the Docs' source code.

The view ServeSitemapXMLBase is executed dynamically on each request (if not cached), but the modification times and everything will be exactly the same unless there was a new build. In this case, instead of generating that response dynamically, it's better to do it just once at build time and then just serve the static sitemap.xml generated by that extension.

The dynamic part needed here is the knowledge of what versions are active at the moment the request is made. In that case, I'd put more effort in RTD generating a sitemap index (see #5391 and https://www.sitemaps.org/protocol.html#index) and listing all the per-version sitemap.xml generated by the sphinx-sitemap extension on each of the version.

Summarizing, IMHO, I'd say that it's a good feature but it's a better fit for the sphinx-sitemap extension than Read the Docs itself.

@eliasdabbas
Copy link
Author

Thanks @humitos !

I didn't realize the sitemaps were a new feature.
I see your point regarding the implementation, and that it fits better in a separate project. Just wanted to check if there was a simple solution for it.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Minor improvement to code
Projects
None yet
Development

No branches or pull requests

3 participants