Merge #1535

bors-voyager[bot] · sgrif · bors-voyager[bot] · commit 005b010aa2d1 · 2018-11-03T22:06:12.000Z
1535: Publish an official crawler policy r=sgrif a=sgrif This is a formalization of a policy that we've been informally enforcing for some time now. The policy basically boils down to: - Just use the index if you can - Contact us to see if we can help in a way that doesn't require crawling if you can't. - If you do crawl, limit to 1RPS - You also have to provide a user agent, which should actually identify your crawler and have contact information - We may still block you if you cause an impact on the integrity of the service. I chose not to explicitly call out inflating a single crate's download numbers as something that's forbidden, as it felt like doing that would be an instance of ["Don't shove beans up your nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose) That falls under the umbrella of "impacting the integrity of the service" though, so this policy does give us an explicit reason to block anyone engaging in that behavior. Co-authored-by: Sean Griffin <sean@seantheprogrammer.com>
diff --git a/app/templates/policies.hbs b/app/templates/policies.hbs
@@ -101,3 +101,44 @@ more details.
 <p>
 Thank you for taking the time to responsibly disclose any issues you find.
 </p>
+
+<h2 id='crawlers'><a href='#crawlers'>Crawlers</a></h2>
+
+<p>
+Before resorting to crawling crates.io, you should first see if you are able to
+gather the information you need from the <a href='https://github.com/rust-lang/crates.io-index'>
+crates.io index</a>, which is a public git repository containing the majority
+of the information availble through our API.
+
+If the index does not have the information you need, we're also happy to
+discuss solutions to your needs that don't require you to crawl the registry.
+You can email us at <a href="mailto:help@crates.io">help@crates.io</a>.
+
+We allow our API and website to be crawled by commercial crawlers such as
+GoogleBot. At our discretion, we may choose to allow access to experimental
+crawlers, as long as they limit their request rate to 1 request per second or
+less.
+
+We also require all crawlers to provide a user-agent header that allows us to
+uniquely identify your bot. This allows us to more accurately monitor any
+impact your bot may have on our service. Providing a user agent that only
+identifies your HTTP client library (such as "request/0.9.1") increases the
+likelihood that we will block your traffic.
+
+It is recommended, but not required, to include contact information in your user
+agent. This allows us to contact you if we would like a change in your bot's
+behavior without having to block your traffic.
+
+Bad:
+  User-Agent: reqwest/0.9.1
+
+Better:
+  User-Agent: my_bot
+
+Best:
+  User-Agent: my_bot (my_bot.com/info)
+  User-Agent: my_bot (help@my_bot.com)
+
+We reserve the right to block traffic from any bot that we determine to be in
+violation of this policy or causing an impact on the integrity of our service.
+</p>
diff --git a/src/middleware/block_ips.rs b/src/middleware/block_ips.rs
@@ -36,8 +36,10 @@ impl Handler for BlockIps {
         if has_blocked_ip {
             let body = format!(
                 "We are unable to process your request at this time. \
+                 This usually means that you are in violation of our crawler \
+                 policy (https://crates.io/policies#crawlers). \
                  Please open an issue at https://github.com/rust-lang/crates.io \
-                 or email crates-io@rust-lang.org \
+                 or email help@crates.io \
                  and provide the request id {}",
                 req.headers().find("X-Request-Id").unwrap()[0]
             );