Skip to content

Commit 88b53bd

Browse files
committed
Auto merge of #2479 - jtgeibel:balance-db-pool-usage, r=pietroalbini
Add middleware to prioritize download traffic In recent months we've had several incidents where bot traffic has sent hundreds of expensive requests per minute, starving database resources and resulting in timeouts on download requests. While cargo will retry download requests, builds are sometimes still affected. For instance, here is an example from our own CI: https://github.com/rust-lang/crates.io/runs/631489355. This new middleware layer will reject some requests as the database pool reaches capacity. At 20% load, the `in_flight_requests` count is added to the log output. At 70% load, all safe requests (`GET`, `HEAD`, `OPTIONS`, `TRACE`) are rejected immediately. This will reject many legitimate frontend requests as well, but should catch all bot traffic (which is unlikely to send `PUT`, `POST`, or `DELETE` requests). This filter also helps avoid rejecting frontend requests that update the database where we don’t always provide good feedback for errors in the UI. Finally, at 80% load all non-download traffic is rejected. In other words, at least 20% of database connections are reserved for handling download traffic. By choosing to drop other requests, there should be sufficient database connections available to avoid queuing and timeouts on download requests. There is some overlap with the `LogConnectionPoolStatus` middleware. These may eventually be consolidated to avoid some duplicate work and to make smarter decisions regarding the instantaneous spare capacity of individual pools (primary vs read-only replica). The current heuristics are very simple, but I believe they are sufficient to meet our current needs with a large margin for growth in traffic. The existing middleware does give us some insight into our current in_flight_request counts on production. Looking through the logs, it is rare to have more than a few requests running at the same time. This is because download requests are completed very quickly and other API traffic accounts for only about 10 requests per second. This middleware layer is added as the last layer in the stack. Requests that are served (such as static `ember` HTML) or blocked by earlier layers will not be processed by this middleware because they do not use a database connection and should not block server threads for long. r? @pietroalbini
2 parents d8d74e6 + 3a8bb56 commit 88b53bd

File tree

5 files changed

+146
-6
lines changed

5 files changed

+146
-6
lines changed

src/app.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ impl App {
4444
/// - GitHub OAuth
4545
/// - Database connection pools
4646
/// - A `git2::Repository` instance from the index repo checkout (that server.rs ensures exists)
47-
pub fn new(config: &Config, http_client: Option<Client>) -> App {
47+
pub fn new(config: Config, http_client: Option<Client>) -> App {
4848
use oauth2::prelude::*;
4949
use oauth2::{AuthUrl, ClientId, ClientSecret, Scope, TokenUrl};
5050
use url::Url;
@@ -126,7 +126,7 @@ impl App {
126126
read_only_replica_database,
127127
github,
128128
session_key: config.session_key.clone(),
129-
config: config.clone(),
129+
config,
130130
http_client,
131131
}
132132
}

src/bin/server.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
2727
let config = cargo_registry::Config::default();
2828
let client = Client::new();
2929

30-
let app = App::new(&config, Some(client));
30+
let app = App::new(config.clone(), Some(client));
3131
let app = cargo_registry::build_handler(Arc::new(app));
3232

3333
// On every server restart, ensure the categories available in the database match

src/middleware.rs

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ use self::log_connection_pool_status::LogConnectionPoolStatus;
1212
use self::static_or_continue::StaticOrContinue;
1313

1414
pub mod app;
15+
mod balance_capacity;
1516
mod block_traffic;
1617
pub mod current_user;
1718
mod debug;
@@ -46,8 +47,6 @@ pub fn build_middleware(app: Arc<App>, endpoints: R404) -> MiddlewareBuilder {
4647
if env == Env::Development {
4748
// Print a log for each request.
4849
m.add(Debug);
49-
// Locally serve crates and readmes
50-
m.around(StaticOrContinue::new("local_uploads"));
5150
}
5251

5352
if env::var_os("DEBUG_REQUESTS").is_some() {
@@ -74,13 +73,39 @@ pub fn build_middleware(app: Arc<App>, endpoints: R404) -> MiddlewareBuilder {
7473

7574
// Note: The following `m.around()` middleware is run from bottom to top
7675

76+
// This is currently the final middleware to run. If a middleware layer requires a database
77+
// connection, it should be run after this middleware so that the potential pool usage can be
78+
// tracked here.
79+
//
80+
// In production we currently have 2 equally sized pools (primary and a read-only replica).
81+
// Because such a large portion of production traffic is for download requests (which update
82+
// download counts), we consider only the primary pool here.
83+
if let Ok(capacity) = env::var("DB_POOL_SIZE") {
84+
if let Ok(capacity) = capacity.parse() {
85+
if capacity >= 10 {
86+
println!(
87+
"Enabling BalanceCapacity middleware with {} pool capacity",
88+
capacity
89+
);
90+
m.around(balance_capacity::BalanceCapacity::new(capacity))
91+
} else {
92+
println!("BalanceCapacity middleware not enabled. DB_POOL_SIZE is too low.");
93+
}
94+
}
95+
}
96+
7797
// Serve the static files in the *dist* directory, which are the frontend assets.
7898
// Not needed for the backend tests.
7999
if env != Env::Test {
80100
m.around(EmberHtml::new("dist"));
81101
m.around(StaticOrContinue::new("dist"));
82102
}
83103

104+
if env == Env::Development {
105+
// Locally serve crates and readmes
106+
m.around(StaticOrContinue::new("local_uploads"));
107+
}
108+
84109
m.around(Head::default());
85110

86111
for (header, blocked_values) in config.blocked_traffic {

src/middleware/balance_capacity.rs

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
//! Reject certain requests as instance load reaches capacity.
2+
//!
3+
//! The primary goal of this middleware is to avoid starving the download endpoint of resources.
4+
//! When bots send many parallel requests that run slow database queries, download requests may
5+
//! block and eventually timeout waiting for a database connection.
6+
//!
7+
//! Bots must continue to respect our crawler policy, but until we can manually block bad clients
8+
//! we should avoid dropping download requests even if that means rejecting some legitimate
9+
//! requests to other endpoints.
10+
11+
use std::sync::atomic::{AtomicUsize, Ordering};
12+
13+
use super::prelude::*;
14+
use conduit::{RequestExt, StatusCode};
15+
16+
#[derive(Default)]
17+
pub(super) struct BalanceCapacity {
18+
handler: Option<Box<dyn Handler>>,
19+
capacity: usize,
20+
in_flight_requests: AtomicUsize,
21+
log_at_percentage: usize,
22+
throttle_at_percentage: usize,
23+
dl_only_at_percentage: usize,
24+
}
25+
26+
impl BalanceCapacity {
27+
pub fn new(capacity: usize) -> Self {
28+
Self {
29+
handler: None,
30+
capacity,
31+
in_flight_requests: AtomicUsize::new(0),
32+
log_at_percentage: read_env_percentage("WEB_CAPACITY_LOG_PCT", 20),
33+
throttle_at_percentage: read_env_percentage("WEB_CAPACITY_THROTTLE_PCT", 70),
34+
dl_only_at_percentage: read_env_percentage("WEB_CAPACITY_DL_ONLY_PCT", 80),
35+
}
36+
}
37+
}
38+
39+
impl AroundMiddleware for BalanceCapacity {
40+
fn with_handler(&mut self, handler: Box<dyn Handler>) {
41+
self.handler = Some(handler);
42+
}
43+
}
44+
45+
impl Handler for BalanceCapacity {
46+
fn call(&self, request: &mut dyn RequestExt) -> AfterResult {
47+
// The _drop_on_exit ensures the counter is decremented for all exit paths (including panics)
48+
let (_drop_on_exit, count) = RequestCounter::add_one(&self.in_flight_requests);
49+
let handler = self.handler.as_ref().unwrap();
50+
let load = 100 * count / self.capacity;
51+
52+
// Begin logging request count so early stages of load increase can be located
53+
if load >= self.log_at_percentage {
54+
super::log_request::add_custom_metadata(request, "in_flight_requests", count);
55+
}
56+
57+
// Download requests are always accepted
58+
if request.path().starts_with("/api/v1/crates/") && request.path().ends_with("/download") {
59+
return handler.call(request);
60+
}
61+
62+
// Reject read-only requests as load nears capacity. Bots are likely to send only safe
63+
// requests and this helps prioritize requests that users may be reluctant to retry.
64+
if load >= self.throttle_at_percentage && request.method().is_safe() {
65+
return over_capacity_response(request);
66+
}
67+
68+
// As load reaches capacity, all non-download requests are rejected
69+
if load >= self.dl_only_at_percentage {
70+
return over_capacity_response(request);
71+
}
72+
73+
handler.call(request)
74+
}
75+
}
76+
77+
fn over_capacity_response(request: &mut dyn RequestExt) -> AfterResult {
78+
// TODO: Generate an alert so we can investigate
79+
super::log_request::add_custom_metadata(request, "cause", "over capacity");
80+
let body = "Service temporarily unavailable";
81+
Response::builder()
82+
.status(StatusCode::SERVICE_UNAVAILABLE)
83+
.header(header::CONTENT_LENGTH, body.len())
84+
.body(Body::from_static(body.as_bytes()))
85+
.map_err(box_error)
86+
}
87+
88+
fn read_env_percentage(name: &str, default: usize) -> usize {
89+
if let Ok(value) = std::env::var(name) {
90+
value.parse().unwrap_or(default)
91+
} else {
92+
default
93+
}
94+
}
95+
96+
// FIXME(JTG): I've copied the following from my `conduit-hyper` crate. Once we transition from
97+
// `civet`, we could pass the in_flight_request count from `condut-hyper` via a request extension.
98+
99+
/// A struct that stores a reference to an atomic counter so it can be decremented when dropped
100+
struct RequestCounter<'a> {
101+
counter: &'a AtomicUsize,
102+
}
103+
104+
impl<'a> RequestCounter<'a> {
105+
fn add_one(counter: &'a AtomicUsize) -> (Self, usize) {
106+
let previous = counter.fetch_add(1, Ordering::SeqCst);
107+
(Self { counter }, previous + 1)
108+
}
109+
}
110+
111+
impl<'a> Drop for RequestCounter<'a> {
112+
fn drop(&mut self) {
113+
self.counter.fetch_sub(1, Ordering::SeqCst);
114+
}
115+
}

src/tests/all.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ fn build_app(
160160
None
161161
};
162162

163-
let app = App::new(&config, client);
163+
let app = App::new(config, client);
164164
t!(t!(app.primary_database.get()).begin_test_transaction());
165165
let app = Arc::new(app);
166166
let handler = cargo_registry::build_handler(Arc::clone(&app));

0 commit comments

Comments
 (0)