Option A — Ollama (Easiest)
Ollama provides a one-command local model server with an OpenAI-compatible API.
# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model — Qwen2.5 14B is excellent for most tasks
ollama pull qwen2.5:14b
# Or smaller models for low-RAM servers:
# ollama pull qwen2.5:7b
# ollama pull llama3.2:3b
# ollama pull phi4-mini
# Verify it's running
curl http://127.0.0.1:11434/api/tags{
"providers": {
"ollama-local": {
"type": "openai-compatible",
"base_url": "http://127.0.0.1:11434/v1",
"api_key": "ollama",
"models": {
"qwen2.5:14b": {
"alias": "local-smart",
"max_tokens": 8192
},
"qwen2.5:7b": {
"alias": "local-fast",
"max_tokens": 4096
}
}
}
}
}Option B — llama.cpp Server
For maximum control over quantization and hardware utilization.
# Build (requires cmake, gcc)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # remove CUDA flag if no GPU
cmake --build build -j$(nproc) --config Release
# Download a GGUF model (example: Qwen2.5 7B Q4)
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf
# Start server (OpenAI-compatible endpoint)
./build/bin/llama-server \
--model qwen2.5-7b-instruct-q4_k_m.gguf \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 8192 \
--n-gpu-layers 35 # adjust for your GPU VRAM{
"providers": {
"llamacpp": {
"type": "openai-compatible",
"base_url": "http://127.0.0.1:8080/v1",
"api_key": "sk-local",
"models": {
"qwen2.5-7b": {
"alias": "local",
"max_tokens": 4096,
"temperature": 0.7
}
}
}
}
}Option C — Any OpenAI-Compatible Endpoint
LM Studio, Jan, vLLM, TabbyAPI, and dozens of other tools expose an OpenAI-compatible API. The same config pattern applies:
{
"providers": {
"my-local-server": {
"type": "openai-compatible",
"base_url": "http://127.0.0.1:1234/v1",
"api_key": "not-needed",
"models": {
"local-model": {
"alias": "local"
}
}
}
}
}Performance Tuning
| RAM | Recommended Model | Quality |
|---|---|---|
| 4 GB | phi4-mini, llama3.2:3b | Basic |
| 8 GB | qwen2.5:7b, mistral:7b | Good |
| 16 GB | qwen2.5:14b, deepseek-r1:14b | Very Good |
| 32 GB+ | qwen2.5:32b, deepseek-r1:32b | Excellent |
Hybrid Setup: Local + Cloud Fallback
Use local models for most requests, fall back to cloud APIs for complex tasks:
{
"routing": {
"strategy": "fallback",
"models": [
{ "id": "ollama-local/qwen2.5:14b", "timeout_ms": 60000, "on_error": "next" },
{ "id": "openai/gpt-4o-mini", "on_error": "fail" }
]
}
}What's Next?
- Non-OpenAI Cloud Models — cost-effective API alternatives
- Browse Provider Templates — ready-made provider configs
- Security Hardening — lock down your local endpoint