Mehmet Erturk | Practical Allocation Hunting in Rust: Patterns That Actually Matter

Allocation hunting in Rust

Rust gives you control over allocations. It does not give you awareness of them.

A .to_string() here, a .clone() there, a Vec::contains that silently allocates — these don't show up in code review because they look normal. They add up until your hot path is spending more time in the allocator than in your actual logic.

I recently audited my AI workflow engine and found twelve distinct allocation patterns worth eliminating. Here are the ones that mattered most.

ByteCounter: Measure Without Materializing

The workflow engine tracks aggregate output size across all nodes. The original approach serialized each output to a String, measured .len(), and immediately threw the string away.

let json_str = serde_json::to_string(&output)?;
*output_bytes += json_str.len();

That's a full heap allocation and serialization pass just to count bytes. The fix: a zero-allocation io::Write that absorbs bytes without storing them.

struct ByteCounter(usize);

impl std::io::Write for ByteCounter {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        self.0 += buf.len();
        Ok(buf.len())
    }
    fn flush(&mut self) -> std::io::Result<()> {
        Ok(())
    }
}

let mut counter = ByteCounter(0);
serde_json::to_writer(&mut counter, &output)?;
*output_bytes += counter.0;

serde_json::to_writer streams through io::Write. The ByteCounter sums buf.len() for each chunk and returns success.

Same byte count, zero heap allocation. In a workflow with 50 nodes, that's 50 unnecessary String allocations eliminated per execution.

LazyLock: Initialize Once, Use Forever

Five separate hot-path items were being reconstructed on every call.

Template compilation. The Tera template engine was recompiling templates on every render via Tera::one_off. Templates in loops or across executions hit the same compile step repeatedly.

// Before: recompile on every call
Tera::one_off(template, &ctx, false)

// After: compile once, cache forever
static TEMPLATE_CACHE: LazyLock<Mutex<HashMap<String, Tera>>> =
    LazyLock::new(|| Mutex::new(HashMap::new()));

let mut cache = TEMPLATE_CACHE.lock().unwrap_or_else(|e| e.into_inner());
let tera = cache.entry(template.to_string()).or_insert_with(|| {
    let mut t = Tera::default();
    let _ = t.add_raw_template("__tpl", template);
    t
});
tera.render("__tpl", &ctx).unwrap_or_else(|_| template.to_string())

The Mutex contention is near-zero because different nodes use different templates. A size cap (1000 entries) with simple eviction prevents unbounded growth.

SSRF guard and HTTP client. SsrfGuard::strict() allocates a blocked-hosts list (8 strings) on construction. It was called per action node execution.

static STRICT_SSRF_GUARD: LazyLock<SsrfGuard> = LazyLock::new(SsrfGuard::strict);

static SHARED_HTTP_CLIENT: LazyLock<reqwest::Client> = LazyLock::new(|| {
    reqwest::Client::builder()
        .timeout(std::time::Duration::from_secs(30))
        .redirect(reqwest::redirect::Policy::none())
        .build()
        .unwrap_or_else(|_| reqwest::Client::new())
});

Redis Lua scripts. redis::Script::new computes a SHA1 hash of the Lua source. Three scripts were being re-hashed on every call.

static RATE_LIMIT_SCRIPT: LazyLock<redis::Script> = LazyLock::new(|| {
    redis::Script::new(r"
        local count = redis.call('INCR', KEYS[1])
        if count == 1 then redis.call('EXPIRE', KEYS[1], ARGV[1]) end
        return {count, redis.call('TTL', KEYS[1])}
    ")
});

The pattern is the same in every case: if the data is constant or process-scoped, LazyLock it.

Vec::contains vs iter().any(): The Hidden Allocation

This one is subtle. Vec<String>::contains takes a &String. If you have a &str, the compiler forces a to_string() to satisfy the type.

// Before: allocates a String for every check
if self.denied_tools.contains(&tool_name.to_string()) { ... }
if self.allowed_tools.contains(&"*".to_string()) { ... }

// After: zero allocation, uses PartialEq<&str>
if self.denied_tools.iter().any(|s| s == tool_name) { ... }
if self.allowed_tools.iter().any(|s| s == "*") { ... }

String implements PartialEq<&str>, so s == tool_name compares directly. This function was called on every tool invocation in the orchestrator loop — potentially dozens of times per execution.

Arc<str> for Broadcast Strings

During SSE streaming, three identifier strings (workspace ID, execution ID, agent name) are cloned into every event. With String, each clone is a heap allocation plus memcpy.

let ws_workspace_id: Arc<str> = Arc::from(stream_workspace_id.as_str());
let ws_exec_id: Arc<str> = Arc::from(exec_id.as_str());

With Arc<str>, cloning is an atomic reference count increment. One allocation at stream start, pointer bumps for every event after.

Cow: Only Allocate When Something Changed

The redaction module runs regex replacements across tool outputs. Regex::replace_all returns Cow::Borrowed when there are zero matches — the common case. The old code called .to_string() unconditionally.

// Before: allocates even when nothing was replaced
result = re.replace_all(&result, REDACTED).to_string();

// After: only take ownership when Cow is Owned
let replaced = re.replace_all(&result, REDACTED);
if let std::borrow::Cow::Owned(s) = replaced {
    result = s;
}

Same function also had SENSITIVE_KEYS stored in mixed case, calling .to_lowercase() on every constant during comparison — 16 allocations per call. Storing constants in lowercase and comparing directly eliminated all of them.

Derive Copy on Fieldless Enums

// Before
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum TransactionType { Deduction, Grant, Adjustment, ... }

// After
#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
pub enum TransactionType { Deduction, Grant, Adjustment, ... }

Without Copy, .clone() goes through the Clone trait machinery. With Copy, the compiler copies the discriminant byte directly. Small win per call, but these enums were cloned in every credit transaction.

Match the Variant First

The is_unique_violation helper checked for specific error messages:

// Before: allocates for ALL error variants
let msg = err.to_string().to_lowercase();
msg.contains("unique") || msg.contains("already exists")

// After: match variant first, only allocate for Database errors
let msg = match err {
    AppError::Database(msg) => msg,
    _ => return false,
};

The old code called err.to_string() on NotFound, Unauthorized, and every other variant — allocating a String just to not find "unique" in it.

The Correctness Bonus

The allocation audit uncovered a correctness issue: casting NaN or Infinity to u64 produces silently wrong values in Rust — NaN becomes 0, Infinity becomes u64::MAX. The credit estimator had no guard:

// Before: silently wrong on NaN/Infinity
(estimated * SAFETY_MARGIN) as u64

// After: explicit guard at every credit service entry point
if !amount.is_finite() {
    return Err(AppError::BadRequest(format!("{field} must be a finite number")));
}

Added is_finite() checks at all seven credit service entry points. Not an allocation fix, but the kind of thing you only find when you read every line.

The allocator doesn't bill you per call. It bills you in latency, cache misses, and GC pressure you never see in a profile because it's spread across everything. Hunt the allocations before they hunt your tail latency.