Inference

Deploy models for real-time and batch inference with auto-scaling.

Auto-scaling inference endpoints and batch inference are now generally available. These features allow you to dynamically scale based on demand and process large offline workloads efficiently.

Inference Endpoints

Inference endpoints are managed HTTP services that serve your models in production. Each endpoint is backed by one or more vLLM worker replicas running on dedicated GPU instances. The platform handles load balancing, health checks, and automatic failover.

Endpoints expose an OpenAI-compatible API, so you can use existing client libraries and tools without modification. The platform adds authentication, rate limiting, and usage tracking on top.

OpenAI-Compatible API

All inference endpoints are compatible with the OpenAI chat completions API format. You can use the official OpenAI Python/Node.js SDKs by pointing them at your Riven endpoint:

openai-client.py
python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.riven-ai.dev/v1",
    api_key="<your-riven-api-key>",
)
 
response = client.chat.completions.create(
    model="qwen3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Kubernetes in one sentence."},
    ],
    max_tokens=100,
    temperature=0.7,
)
 
print(response.choices[0].message.content)

Response Format

Response
json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711648200,
  "model": "qwen3-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Kubernetes is an open-source container orchestration platform that automates deploying, scaling, and managing containerized applications across clusters of machines."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 28,
    "total_tokens": 52
  }
}

Deploying an Endpoint

Deploy a registered model as an inference endpoint using the CLI:

Terminal
bash
# Deploy a model as an inference endpoint
riven inference deploy \
  --model my-model \
  --version 1.0.0 \
  --replicas 2 \
  --gpu-type a10g
 
# Check endpoint status
riven inference status my-model
 
# Test the endpoint
curl -X POST https://api.riven-ai.dev/v1/inference/my-model/generate \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world!", "max_tokens": 100}'

The deployment process pulls model artifacts from the registry, provisions GPU nodes if needed, and starts the vLLM serving containers. The endpoint becomes available once at least one replica passes its health check.

Auto-scaling

Inference endpoints support horizontal auto-scaling based on request queue depth. When the pending request queue exceeds a configurable threshold, the platform automatically adds replicas. When demand drops, it scales back down to save resources.

autoscale-config.yaml
yaml
endpoint: my-model
autoscaling:
  enabled: true
  min_replicas: 1
  max_replicas: 8
  target_queue_depth: 10        # scale up when queue exceeds this
  scale_up_cooldown: 60s
  scale_down_cooldown: 300s
  scale_down_delay: 600s        # wait before removing idle replicas
Terminal
bash
# Enable auto-scaling on an existing endpoint
riven inference autoscale my-model --config autoscale-config.yaml
 
# View current scaling status
riven inference autoscale status my-model

Batch Inference

For offline processing of large datasets, use batch inference. Batch jobs read input from S3, process all records through the model, and write results back to S3. This is significantly more cost-effective than real-time endpoints for bulk workloads.

Terminal
bash
# Submit a batch inference job
riven inference batch \
  --model my-model \
  --version 1.0.0 \
  --input s3://data/input.jsonl \
  --output s3://data/output.jsonl \
  --concurrency 4
 
# Check batch job status
riven inference batch status <job-id>

Batch Input/Output Format

Input and output files use JSONL format:

input.jsonl
json
{"id": "req-001", "prompt": "Summarize this code: def add(a, b): return a + b", "max_tokens": 100}
{"id": "req-002", "prompt": "Summarize this code: class Stack: ...", "max_tokens": 100}
output.jsonl
json
{"id": "req-001", "output": "A simple function that returns the sum of two numbers.", "tokens_used": 12, "status": "success"}
{"id": "req-002", "output": "A stack data structure implementation with push and pop operations.", "tokens_used": 14, "status": "success"}

Batch jobs support automatic checkpointing — if a job is interrupted, it resumes from the last completed batch rather than reprocessing the entire dataset.

Performance Tuning

Several techniques can improve inference throughput and latency:

  • Quantization — Reduce model precision from FP16 to INT8 or INT4. This cuts memory usage and increases throughput with minimal quality loss for most models.
  • Speculative Decoding — Use a smaller draft model to predict multiple tokens, then verify with the full model. Can improve throughput by 2-3x for compatible architectures.
  • Continuous Batching — vLLM's PagedAttention enables continuous batching, processing new requests as soon as slots free up rather than waiting for full batch completion.
  • KV Cache Optimization — Tune the GPU memory utilization parameter to balance between KV cache size and model weights.
Terminal
bash
# Deploy with INT8 quantization
riven inference deploy \
  --model my-model \
  --version 1.0.0 \
  --quantization int8
 
# Enable speculative decoding with a draft model
riven inference deploy \
  --model my-model \
  --version 1.0.0 \
  --speculative-model my-model-draft \
  --num-speculative-tokens 5

Run riven inference benchmark my-model to measure latency and throughput across different configurations before deploying to production.