Inference infrastructure

Intelligent load balancing for Ollama, built for clusters that cannot afford guesswork

Travis routes each request using live GPU state, model residency, queue depth, and node health. It prefers warm models, avoids overloaded nodes, and fails over before clients see the damage.

Deploy Travis See how it works

Model-aware routing Automatic failover Single C binary Drop-in for Ollama

Warm-first

Prefers nodes that already hold the model in VRAM

Live score

Routes on pressure, inflight count, and latency

No runtime

Single binary, systemd-friendly deployment

Failover

Unhealthy nodes are bypassed automatically

The problem

Most load balancers do not know what your GPUs are actually doing

Traditional proxies see requests. Inference clusters have to care about what model is already warm, how much VRAM is left, how many jobs are in flight, and whether a node is degrading right now.

Cold-load latency

Round-robin routing sends requests to nodes that do not have the model loaded, turning every miss into wasted time.

Blind capacity decisions

Standard load balancers do not account for VRAM pressure, queue depth, or the real cost of one more inference request.

Fragile failover

By the time a generic proxy notices trouble, the client often already paid for the error with latency or failure.

How it works

A scoring layer that understands inference instead of pretending every node is identical

Travis continuously polls the cluster, computes a readiness score for each node, and sends the request to the node most likely to answer quickly and cleanly.

Poll node state

Background polling tracks model residency, VRAM pressure, in-flight requests, health checks, and round-trip latency.

Score readiness

Those signals are weighted into a single routing decision, with preference for nodes that already have the requested model warm.

Route and retry

If a node goes bad, Travis bypasses it mid-retry so the client is less likely to see the failure at all.

Capabilities

Built for operators running real inference clusters

This is not a generic reverse proxy with AI copy attached to it. It is a small routing layer designed specifically for Ollama and GPU-backed inference traffic.

What Travis does

Prefers nodes with the requested model already loaded in VRAM
Combines VRAM pressure, inflight count, and latency into one routing score
Detects unhealthy nodes and routes around them automatically
Naturally penalises distant nodes through latency EMA
Runs as a single C binary with no runtime dependency chain
Listens like Ollama, so existing clients do not need reconfiguration

Why it matters

Fewer cold starts and better warm-hit rate
Lower tail latency under uneven load
Better effective utilisation of expensive GPU capacity
Cleaner multi-node deployments across regions or sites
Less operator babysitting when a node goes sideways
A practical clustering answer for Ollama today

Install

Small enough to deploy quickly, transparent enough to trust

A single shell command installs Travis and registers it with systemd. Your existing Ollama clients point at the same port — no reconfiguration needed.

Install

curl -fsSL https://get.salloq.com/travis/install.sh | sh
sudo systemctl enable --now travis

Generate a request through Travis

curl http://127.0.0.1:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "hello from Travis",
    "stream": false
  }'

Example config

listen = "0.0.0.0:11434"

[node.edge-sac-01]
url = "http://10.0.0.21:11434"

[node.phx-gpu-02]
url = "http://10.0.0.22:11434"

[node.chi-gpu-01]
url = "http://10.0.0.23:11434"

poll_interval_ms    = 500
connect_timeout_ms  = 750
request_timeout_ms  = 120000

The honest version

There is no usable clustering layer for Ollama, so people route blind

If you are running more than one inference node, you are usually picking a node by hand or relying on a strategy that does not understand model warmth, VRAM pressure, or real node readiness.

Approach	Model awareness	Capacity awareness	Failure handling
Manual node selection	No	Operator guesswork	Manual intervention
Generic round-robin proxy	No	Weak	Generic retry only
Travis	Warm-model preference	VRAM + inflight + latency	Health-aware bypass and retry

Run your inference cluster like a system, not a guess

Built in C. Designed for Ollama. A practical way to turn a pile of inference nodes into something that behaves like infrastructure.

Deploy Travis Read the flow