Skip to content

sitemap.xml contains too many versions #11584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
astrojuanlu opened this issue Sep 4, 2024 · 2 comments · Fixed by #11675
Closed

sitemap.xml contains too many versions #11584

astrojuanlu opened this issue Sep 4, 2024 · 2 comments · Fixed by #11675

Comments

@astrojuanlu
Copy link
Contributor

astrojuanlu commented Sep 4, 2024

Details

Expected Result

RTD should include only a handful of versions (which ones? maybe just /stable and /latest?), so that users don't arrive to old ones from search engines.

See for example what https://docs.rs/ does:

<url>
            <loc>https://docs.rs/clap/latest/clap/</loc>
            <lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
            <priority>1.0</priority>
        </url>
        <url>
            <loc>https://docs.rs/clap/latest/clap/all.html</loc>
            <lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
            <priority>0.8</priority>
        </url>

(from https://docs.rs/-/sitemap/c/sitemap.xml and https://docs.rs/sitemap.xml)

Actual Result

RTD generates a sitemap.xml with every version ever released, even versions that are hidden in the flyout.

See for example: http://web.archive.org/web/20230524052604/https://docs.kedro.org/sitemap.xml (most recent version in Web Archive)

Or the current one:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  
  <url>
    <loc>https://docs.kedro.org/en/stable/</loc>
    
    
    <lastmod>2024-08-22T13:46:14.186643+00:00</lastmod>
    
    <changefreq>weekly</changefreq>
    <priority>1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/latest/</loc>
    
    
    <lastmod>2024-09-03T08:58:39.696624+00:00</lastmod>
    
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.8/</loc>
    
    
    <lastmod>2024-08-22T13:46:14.235672+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.7/</loc>
    
    
    <lastmod>2024-08-01T18:53:11.647322+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.6/</loc>
    
    
    <lastmod>2024-05-27T16:32:42.584307+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.6</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.5/</loc>
    
    
    <lastmod>2024-04-22T11:56:55.928132+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.4.post1/</loc>
    
    
    <lastmod>2024-05-17T12:25:27.050615+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.4</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.4/</loc>
    
    
    <lastmod>2024-04-17T17:31:48.999754+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.3</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.3.post1/</loc>
    
    
    <lastmod>2024-05-17T12:41:00.022532+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.2</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.3/</loc>
    
    
    <lastmod>2024-02-27T12:44:56.427636+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.2.post1/</loc>
    
    
    <lastmod>2024-05-17T13:00:31.858692+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.2/</loc>
    
    
    <lastmod>2024-01-22T11:10:15.437956+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.1.post1/</loc>
    
    
    <lastmod>2024-05-17T13:17:50.964057+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.1/</loc>
    
    
    <lastmod>2024-05-17T10:16:12.387349+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.0.post1/</loc>
    
    
    <lastmod>2024-05-17T13:44:43.293697+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.19.0/</loc>
    
    
    <lastmod>2023-12-12T15:24:56.914274+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.14/</loc>
    
    
    <lastmod>2023-10-18T15:00:32.073920+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.13/</loc>
    
    
    <lastmod>2023-08-31T10:38:01.895157+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.12/</loc>
    
    
    <lastmod>2023-08-01T14:35:47.173291+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.11/</loc>
    
    
    <lastmod>2023-07-03T12:56:40.838516+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.10/</loc>
    
    
    <lastmod>2023-06-08T17:54:10.395047+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.9/</loc>
    
    
    <lastmod>2023-05-31T17:01:28.654716+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.8/</loc>
    
    
    <lastmod>2023-05-02T12:19:20.672884+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.7/</loc>
    
    
    <lastmod>2023-03-22T16:13:32.506938+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.6/</loc>
    
    
    <lastmod>2023-03-06T11:57:28.224705+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.5/</loc>
    
    
    <lastmod>2023-02-20T17:49:38.767300+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.4/</loc>
    
    
    <lastmod>2022-12-05T16:38:35.599612+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.3/</loc>
    
    
    <lastmod>2022-09-20T14:36:18.552880+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.2/</loc>
    
    
    <lastmod>2022-07-08T15:57:53.845627+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.1/</loc>
    
    
    <lastmod>2022-05-09T21:15:51.249996+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.18.0/</loc>
    
    
    <lastmod>2022-03-31T16:09:41.762560+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.7/</loc>
    
    
    <lastmod>2022-02-22T16:03:18.877195+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.6/</loc>
    
    
    <lastmod>2021-12-09T16:03:21.660457+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.5/</loc>
    
    
    <lastmod>2021-09-14T15:14:58.410872+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.4/</loc>
    
    
    <lastmod>2021-06-16T09:13:15.174304+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.3/</loc>
    
    
    <lastmod>2021-04-21T15:14:55.966484+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.2/</loc>
    
    
    <lastmod>2021-03-15T18:13:56.793653+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.1/</loc>
    
    
    <lastmod>2023-12-14T13:32:41.781973+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.17.0/</loc>
    
    
    <lastmod>2020-12-17T13:31:06.941428+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.6/</loc>
    
    
    <lastmod>2020-10-23T10:47:07.341079+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.5/</loc>
    
    
    <lastmod>2020-09-09T12:30:31.680650+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.4/</loc>
    
    
    <lastmod>2020-08-28T13:42:10.459247+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.3/</loc>
    
    
    <lastmod>2020-07-14T08:55:22.655604+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.2/</loc>
    
    
    <lastmod>2020-06-15T15:04:21.101081+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.1/</loc>
    
    
    <lastmod>2020-05-21T13:12:40.307395+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.16.0/</loc>
    
    
    <lastmod>2020-05-20T11:14:31.848879+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.9/</loc>
    
    
    <lastmod>2020-04-06T15:02:31.669224+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.8/</loc>
    
    
    <lastmod>2020-03-05T12:01:28.311014+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.7/</loc>
    
    
    <lastmod>2020-02-26T17:14:15.472335+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.6/</loc>
    
    
    <lastmod>2020-02-26T13:33:33.871955+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.5/</loc>
    
    
    <lastmod>2019-12-12T13:47:06.257893+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.4/</loc>
    
    
    <lastmod>2019-10-30T17:44:01.098967+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.3/</loc>
    
    
    <lastmod>2019-10-17T15:14:03.667063+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.2/</loc>
    
    
    <lastmod>2019-10-08T16:34:50.860364+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.15.0/</loc>
    
    
    <lastmod>2019-09-10T08:57:52.820505+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/0.14.3/</loc>
    
    
    <lastmod>2019-09-10T08:57:58.726528+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
  <url>
    <loc>https://docs.kedro.org/en/develop/</loc>
    
    
    <lastmod>2024-09-03T09:02:55.449397+00:00</lastmod>
    
    <changefreq>monthly</changefreq>
    <priority>0.1</priority>
  </url>
  
</urlset>

This issue has high impact on SEO, given that

  • robots.txt is not the way to hide pages from search results, according to Google 1 2 3
  • It's not easy to serve a custom sitemap.xml that has proper <lastmod> dates for all past versions, and very difficult to do so for /latest.
  • Retrofitting noindex tags for all documentation versions, as suggested in Add meta tags "noindex, nofollow" for hidden version #10648, is very hard if old versions of the project weren't ready for it.
  • And even if one could do it, RTD generates a robots.txt that blocks hidden versions from bots, which therefore cannot see the noindex tag, hence forcing projects to generate their own manually crafted robots.txt.

Front logo Front conversations

Footnotes

  1. "If you do want to block this page from Google Search, robots.txt is not the correct mechanism to avoid being indexed. To avoid being indexed, remove the robots.txt block and use 'noindex'." (source)

  2. "Warning: Don't use a robots.txt file as a means to hide your web pages (including PDFs and other text-based formats supported by Google) from Google search results." (source)

  3. "If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it." (source)

@humitos
Copy link
Member

humitos commented Sep 5, 2024

RTD should include only a handful of versions (which ones? maybe just /stable and /latest?), so that users don't arrive to old ones from search engines.

It's hard to make a decision that works for everybody here; since projects use the versions in a pretty different, and sometimes, unexpected ways. However...

RTD generates a sitemap.xml with every version ever released, even versions that are hidden in the flyout.

I think a good compromise could be to not expose hidden versions in sitemap.xml. Since they are hidden, people shouldn't arrive at those. That seems to be most accurate solution for this, IMO.

We can filter out these hidden versions in

public_versions = Version.internal.public(
project=project,
only_active=True,
)

It's not easy to serve a custom sitemap.xml that has proper <lastmod> dates for all past versions, and very difficult to do so for /latest.

Isn't this possible with something like https://github.com/jdillard/sphinx-sitemap?

robots.txt is not the way to hide pages from search results, according to Google

Should we remove this chunk of code form our robots since it's the recommended way?

{% for path in hidden_paths %}
Disallow: {{ path }} # Hidden version
{% empty %}
Disallow: # Allow everything
{% endfor %}

@agjohnson
Copy link
Contributor

I had a user comment on hidden versions as well. I think removing hidden versions would be a helpful change here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@humitos @astrojuanlu @agjohnson and others