Skip to content

Commit 005b010

Browse files
Merge #1535
1535: Publish an official crawler policy r=sgrif a=sgrif This is a formalization of a policy that we've been informally enforcing for some time now. The policy basically boils down to: - Just use the index if you can - Contact us to see if we can help in a way that doesn't require crawling if you can't. - If you do crawl, limit to 1RPS - You also have to provide a user agent, which should actually identify your crawler and have contact information - We may still block you if you cause an impact on the integrity of the service. I chose not to explicitly call out inflating a single crate's download numbers as something that's forbidden, as it felt like doing that would be an instance of ["Don't shove beans up your nose"](https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans_up_your_nose) That falls under the umbrella of "impacting the integrity of the service" though, so this policy does give us an explicit reason to block anyone engaging in that behavior. Co-authored-by: Sean Griffin <[email protected]>
2 parents b63a985 + a35fea9 commit 005b010

File tree

2 files changed

+44
-1
lines changed

2 files changed

+44
-1
lines changed

app/templates/policies.hbs

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,3 +101,44 @@ more details.
101101
<p>
102102
Thank you for taking the time to responsibly disclose any issues you find.
103103
</p>
104+
105+
<h2 id='crawlers'><a href='#crawlers'>Crawlers</a></h2>
106+
107+
<p>
108+
Before resorting to crawling crates.io, you should first see if you are able to
109+
gather the information you need from the <a href='https://github.com/rust-lang/crates.io-index'>
110+
crates.io index</a>, which is a public git repository containing the majority
111+
of the information availble through our API.
112+
113+
If the index does not have the information you need, we're also happy to
114+
discuss solutions to your needs that don't require you to crawl the registry.
115+
You can email us at <a href="mailto:[email protected]">[email protected]</a>.
116+
117+
We allow our API and website to be crawled by commercial crawlers such as
118+
GoogleBot. At our discretion, we may choose to allow access to experimental
119+
crawlers, as long as they limit their request rate to 1 request per second or
120+
less.
121+
122+
We also require all crawlers to provide a user-agent header that allows us to
123+
uniquely identify your bot. This allows us to more accurately monitor any
124+
impact your bot may have on our service. Providing a user agent that only
125+
identifies your HTTP client library (such as "request/0.9.1") increases the
126+
likelihood that we will block your traffic.
127+
128+
It is recommended, but not required, to include contact information in your user
129+
agent. This allows us to contact you if we would like a change in your bot's
130+
behavior without having to block your traffic.
131+
132+
Bad:
133+
User-Agent: reqwest/0.9.1
134+
135+
Better:
136+
User-Agent: my_bot
137+
138+
Best:
139+
User-Agent: my_bot (my_bot.com/info)
140+
User-Agent: my_bot (help@my_bot.com)
141+
142+
We reserve the right to block traffic from any bot that we determine to be in
143+
violation of this policy or causing an impact on the integrity of our service.
144+
</p>

src/middleware/block_ips.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,10 @@ impl Handler for BlockIps {
3636
if has_blocked_ip {
3737
let body = format!(
3838
"We are unable to process your request at this time. \
39+
This usually means that you are in violation of our crawler \
40+
policy (https://crates.io/policies#crawlers). \
3941
Please open an issue at https://github.com/rust-lang/crates.io \
40-
or email [email protected] \
42+
or email [email protected] \
4143
and provide the request id {}",
4244
req.headers().find("X-Request-Id").unwrap()[0]
4345
);

0 commit comments

Comments
 (0)