Performance Internals
Dwaar is built to serve high-concurrency workloads with predictable tail latency. This page explains the implementation choices that make that possible, how to measure them, and how to go further with a PGO build.
Design Principles
Section titled “Design Principles”Every hot-path decision in Dwaar is justified by at least one of four pillars: Performance, Reliability, Security, or Competitive Parity. The table below covers the choices that directly affect throughput and latency.
| Technique | Crate / Module | Why it matters |
|---|---|---|
| jemalloc allocator | tikv-jemallocator, crates/dwaar-cli/src/main.rs | Eliminates heap fragmentation from per-request string alloc/free churn. Removes allocator lock contention that causes tail-latency spikes under high concurrency. Replaces the system allocator globally via #[global_allocator]. |
| ArcSwap lock-free reads | arc-swap, route table, config | Hot-reload swaps the entire route table atomically. Readers never block — a load takes ~1 ns and never contends with a concurrent reload. |
| CompactString | compact_str, RequestLog, context fields | Strings ≤24 bytes (method, status, short paths) are stored inline on the stack — no heap allocation, no pointer indirection. Reduces allocator pressure on the hot path. |
| sonic-rs JSON serialization | sonic-rs, dwaar-log | SIMD-accelerated JSON serialization. RequestLog serialization target is <1 µs per entry (see bench in crates/dwaar-log/benches/request_log.rs). |
| Bytes zero-copy | bytes, response body handling | Response bodies use Bytes reference-counted slices. Forwarding a response body from upstream to downstream copies zero bytes — only the reference counter increments. |
| Connection-owned buffers | BufferedConn, crates/dwaar-core/src/quic/pool.rs | Read buffers live with the pooled connection, not the request. Eliminates per-request 64 KB allocations in the H3 streaming path. Buffer starts at 8 KB and grows to 64 KB on demand. Zero-copy via BytesMut::split_to().freeze(). |
| H2 upstream multiplexing | H2ConnPool, crates/dwaar-core/src/quic/h2_pool.rs | When transport h2 is configured, H3 streams multiplex onto 1-2 shared H2 connections per upstream host. Reduces 5,000 TCP connections to ~20-40. GOAWAY-aware with automatic retry for idempotent methods. |
| Batch log writing | dwaar-log, BatchWriter | Log entries are queued in a channel and flushed in batches of up to 200 entries per syscall. Removes one write() syscall per request from the hot path. |
| Pingora work-stealing scheduler | pingora-core | Pingora uses Tokio with a work-stealing thread pool. Idle threads steal tasks from busy threads, keeping all cores saturated without manual thread pinning. |
| Multi-worker fork | crates/dwaar-cli/src/main.rs | dwaar run --workers N forks N worker processes before Pingora initializes. Each worker owns its own Tokio runtime and memory space. CPU-bound work scales linearly with cores. |
| Single ArcSwap load per request | RequestContext, crates/dwaar-core/src/context.rs | RequestContext loads the route table once at the start of request_filter and caches all derived values. Downstream callbacks read from RequestContext — no second ArcSwap load. |
Memory Profile
Section titled “Memory Profile”Dwaar’s memory footprint is small and predictable. It does not grow with traffic volume — only with configuration complexity.
| Configuration | Approximate RSS |
|---|---|
| Single site, analytics off | ~5 MB |
| 100 domains, analytics on | ~6 MB |
| Per active downstream connection (Pingora H1/H2) | ~135 KB |
| Per active upstream connection (H1, pooled) | ~8 KB (connection-owned buffer) |
| Per active upstream connection (H2, multiplexed) | ~150 KB (shared across all streams) |
| Per H3 request stream overhead | ~1 KB (no per-request buffer alloc) |
HTTP cache (depends on max_size setting) | bounded by config |
The dominant allocations at steady state are:
- Route table — one
Arc<RouteTable>per active generation (two during hot-reload). Each compiled route holds pre-compiled regex matchers. - Connection buffers — Pingora allocates read/write buffers per connection. Connections are released back to the OS on close.
- Log channel — bounded MPSC channel holds at most ~200 log entries between flushes (~200 KB peak).
- Analytics aggregations — HyperLogLog sketches and TDigest accumulators are fixed-size per domain (~4 KB per domain).
If RSS grows continuously over hours, the likely cause is an unbounded HTTP cache. Set an explicit limit:
cache { max_size 512mb}Benchmarks
Section titled “Benchmarks”The benchmarks/docker/ directory contains a fully reproducible Dwaar-vs-nginx benchmark.
Three Docker containers run on the same host, each limited to 2 CPU cores and 256 MB RAM:
| Container | Role | Port |
|---|---|---|
backend | nginx serving a static 200 OK | 9090 |
dwaar | Dwaar proxying to backend | 6188 |
nginx-proxy | nginx proxying to backend | 8080 |
The benchmark tool is wrk.
Running
Section titled “Running”cd benchmarks/docker./run-benchmark.shThe script runs three connection levels (100, 500, 1000) against each target for 30 seconds each and prints a comparison table:
Proxy Conns RPS p50 lat p99 lat Avg lat Transfer Mem(MB)────────── ────── ────────── ──────── ──────── ──────── ────────── ──────direct 100 ... ... ... ... ... n/aDWAAR 100 ... ... ... ... ... ...NGINX 100 ... ... ... ... ... ...To customize duration, threads, or connection counts:
BENCH_DURATION=60s BENCH_THREADS=8 BENCH_CONNECTIONS="100 1000 5000" ./run-benchmark.shWhat the numbers mean
Section titled “What the numbers mean”- RPS — requests per second. Dwaar targets >100K RPS on a single core for simple proxy workloads.
- p99 latency — Dwaar’s proxy overhead target is <1 ms added to upstream latency at p99.
- Memory — Dwaar’s base footprint should be lower than nginx under equivalent load because jemalloc eliminates fragmentation and the route table is a single ArcSwap read.
Profiling
Section titled “Profiling”CPU flamegraph (Linux)
Section titled “CPU flamegraph (Linux)”# Record with perfsudo perf record -F 99 -p $(pgrep dwaar) -g -- sleep 30sudo perf script | stackcollapse-perf.pl | flamegraph.pl > dwaar.svgOn macOS, use cargo flamegraph:
cargo install flamegraphsudo cargo flamegraph --bin dwaar -- run --config DwaarfileOpen flamegraph.svg in a browser. Look for wide frames in:
DwaarProxy::request_filter— route lookup and plugin chainDwaarProxy::logging— JSON serializationBatchWriter::flush— log I/O
Criterion microbenchmarks
Section titled “Criterion microbenchmarks”# Run all benchmarks and open the HTML reportcargo benchopen target/criterion/report/index.html
# Run only the log serialization benchcargo bench -p dwaar-logCriterion benches live in:
crates/dwaar-log/benches/request_log.rs—RequestLogJSON serialization (<1 µs target)
jemalloc allocation stats
Section titled “jemalloc allocation stats”To see allocator stats at runtime without restarting:
MALLOC_CONF=stats_print:true dwaar run --config Dwaarfile 2>&1 | grep -A 40 "jemalloc stats"This prints arena utilization, fragmentation ratio, and per-size-class counts. A fragmentation ratio above 1.3 is worth investigating.
PGO Build
Section titled “PGO Build”Profile-Guided Optimization feeds real traffic patterns back to LLVM so it can inline hot functions, reorder basic blocks, and eliminate cold-path checks from the fast path. Typical gain is 5–15% throughput improvement.
Running the PGO pipeline
Section titled “Running the PGO pipeline”./scripts/pgo-build.shThe script runs four steps automatically:
Step 1 — Instrumented build (RUSTFLAGS=-Cprofile-generate=...)Step 2 — Profile collection (30s of real or simulated traffic)Step 3 — Merge profiles (llvm-profdata merge)Step 4 — PGO-optimized build (RUSTFLAGS=-Cprofile-use=...)Using a custom workload
Section titled “Using a custom workload”The default workload is a simple curl loop. For a more representative profile, point the script at your stress suite:
PGO_WORKLOAD=/path/to/my-workload.sh PGO_DURATION=60 ./scripts/pgo-build.shThe workload script receives STRESS_TARGET=host:port in its environment.
Environment variables
Section titled “Environment variables”| Variable | Default | Description |
|---|---|---|
PROFILE_DIR | target/pgo-profiles | Where .profraw files are stored |
PGO_DURATION | 30 | Seconds to run the profiling workload |
PGO_WORKLOAD | — | Path to a custom workload script |
DWAAR_CONFIG | auto-generated | Dwaarfile to use during profiling |
Prerequisites
Section titled “Prerequisites”rustup component add llvm-tools # provides llvm-profdataOn macOS, llvm-profdata can also be found via xcrun. The script detects all three locations automatically.
Related
Section titled “Related”- Compression — gzip / brotli / zstd reduce bytes transferred, shifting the bottleneck from network to CPU.
- Caching —
pingora-cacheeliminates upstream round-trips for cacheable responses. - HTTP/3 — QUIC reduces connection setup latency and eliminates head-of-line blocking.
- Timeouts — Proper timeout configuration prevents slow upstreams from exhausting connection pool slots.