Travis routes each request using live GPU state, model residency, queue depth, and node health. It prefers warm models, avoids overloaded nodes, and fails over before clients see the damage.
Traditional proxies see requests. Inference clusters have to care about what model is already warm, how much VRAM is left, how many jobs are in flight, and whether a node is degrading right now.
Round-robin routing sends requests to nodes that do not have the model loaded, turning every miss into wasted time.
Standard load balancers do not account for VRAM pressure, queue depth, or the real cost of one more inference request.
By the time a generic proxy notices trouble, the client often already paid for the error with latency or failure.
Travis continuously polls the cluster, computes a readiness score for each node, and sends the request to the node most likely to answer quickly and cleanly.
Background polling tracks model residency, VRAM pressure, in-flight requests, health checks, and round-trip latency.
Those signals are weighted into a single routing decision, with preference for nodes that already have the requested model warm.
If a node goes bad, Travis bypasses it mid-retry so the client is less likely to see the failure at all.
This is not a generic reverse proxy with AI copy attached to it. It is a small routing layer designed specifically for Ollama and GPU-backed inference traffic.
A single shell command installs Travis and registers it with systemd. Your existing Ollama clients point at the same port — no reconfiguration needed.
curl -fsSL https://get.salloq.com/travis/install.sh | sh
sudo systemctl enable --now travis
curl http://127.0.0.1:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"prompt": "hello from Travis",
"stream": false
}'
listen = "0.0.0.0:11434"
[node.edge-sac-01]
url = "http://10.0.0.21:11434"
[node.phx-gpu-02]
url = "http://10.0.0.22:11434"
[node.chi-gpu-01]
url = "http://10.0.0.23:11434"
poll_interval_ms = 500
connect_timeout_ms = 750
request_timeout_ms = 120000
If you are running more than one inference node, you are usually picking a node by hand or relying on a strategy that does not understand model warmth, VRAM pressure, or real node readiness.
| Approach | Model awareness | Capacity awareness | Failure handling |
|---|---|---|---|
| Manual node selection | No | Operator guesswork | Manual intervention |
| Generic round-robin proxy | No | Weak | Generic retry only |
| Travis | Warm-model preference | VRAM + inflight + latency | Health-aware bypass and retry |
Built in C. Designed for Ollama. A practical way to turn a pile of inference nodes into something that behaves like infrastructure.