Inference infrastructure

Intelligent load balancing for Ollama, built for clusters that cannot afford guesswork

Travis routes each request using live GPU state, model residency, queue depth, and node health. It prefers warm models, avoids overloaded nodes, and fails over before clients see the damage.

Model-aware routing Automatic failover Single C binary Drop-in for Ollama
Warm-first
Prefers nodes that already hold the model in VRAM
Live score
Routes on pressure, inflight count, and latency
No runtime
Single binary, systemd-friendly deployment
Failover
Unhealthy nodes are bypassed automatically
The problem

Most load balancers do not know what your GPUs are actually doing

Traditional proxies see requests. Inference clusters have to care about what model is already warm, how much VRAM is left, how many jobs are in flight, and whether a node is degrading right now.

01

Cold-load latency

Round-robin routing sends requests to nodes that do not have the model loaded, turning every miss into wasted time.

02

Blind capacity decisions

Standard load balancers do not account for VRAM pressure, queue depth, or the real cost of one more inference request.

03

Fragile failover

By the time a generic proxy notices trouble, the client often already paid for the error with latency or failure.

How it works

A scoring layer that understands inference instead of pretending every node is identical

Travis continuously polls the cluster, computes a readiness score for each node, and sends the request to the node most likely to answer quickly and cleanly.

Poll node state

Background polling tracks model residency, VRAM pressure, in-flight requests, health checks, and round-trip latency.

Score readiness

Those signals are weighted into a single routing decision, with preference for nodes that already have the requested model warm.

Route and retry

If a node goes bad, Travis bypasses it mid-retry so the client is less likely to see the failure at all.

Capabilities

Built for operators running real inference clusters

This is not a generic reverse proxy with AI copy attached to it. It is a small routing layer designed specifically for Ollama and GPU-backed inference traffic.

What Travis does

  • Prefers nodes with the requested model already loaded in VRAM
  • Combines VRAM pressure, inflight count, and latency into one routing score
  • Detects unhealthy nodes and routes around them automatically
  • Naturally penalises distant nodes through latency EMA
  • Runs as a single C binary with no runtime dependency chain
  • Listens like Ollama, so existing clients do not need reconfiguration

Why it matters

  • Fewer cold starts and better warm-hit rate
  • Lower tail latency under uneven load
  • Better effective utilisation of expensive GPU capacity
  • Cleaner multi-node deployments across regions or sites
  • Less operator babysitting when a node goes sideways
  • A practical clustering answer for Ollama today
Install

Small enough to deploy quickly, transparent enough to trust

A single shell command installs Travis and registers it with systemd. Your existing Ollama clients point at the same port — no reconfiguration needed.

Install
curl -fsSL https://get.salloq.com/travis/install.sh | sh
sudo systemctl enable --now travis
Generate a request through Travis
curl http://127.0.0.1:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "hello from Travis",
    "stream": false
  }'
Example config
listen = "0.0.0.0:11434"

[node.edge-sac-01]
url = "http://10.0.0.21:11434"

[node.phx-gpu-02]
url = "http://10.0.0.22:11434"

[node.chi-gpu-01]
url = "http://10.0.0.23:11434"

poll_interval_ms    = 500
connect_timeout_ms  = 750
request_timeout_ms  = 120000
The honest version

There is no usable clustering layer for Ollama, so people route blind

If you are running more than one inference node, you are usually picking a node by hand or relying on a strategy that does not understand model warmth, VRAM pressure, or real node readiness.

Approach Model awareness Capacity awareness Failure handling
Manual node selection No Operator guesswork Manual intervention
Generic round-robin proxy No Weak Generic retry only
Travis Warm-model preference VRAM + inflight + latency Health-aware bypass and retry

Run your inference cluster like a system, not a guess

Built in C. Designed for Ollama. A practical way to turn a pile of inference nodes into something that behaves like infrastructure.