Observability

Full-stack observability with metrics, logs, and traces.

Observability Stack

The Riven platform uses a fully integrated observability stack deployed in the monitoring namespace. The stack provides three pillars of observability — metrics, logs, and traces — unified through Grafana as the single pane of glass.

Observability stack

text

┌─────────────────────────────────────────┐
│         Grafana (Dashboards + Explore)  │
├────────────┬────────────┬───────────────┤
│ Prometheus │    Loki    │    Jaeger     │
│  (metrics) │   (logs)   │  (traces)    │
├────────────┴────────────┴───────────────┤
│    Alloy (unified collection agent)     │
├─────────────────────────────────────────┤
│     ServiceMonitors · Pod Logs · OTLP   │
└─────────────────────────────────────────┘

All components are deployed via the observability Helm charts and are pre-configured to integrate with each other. Grafana data sources for Prometheus, Loki, and Jaeger are provisioned automatically.

Grafana Dashboards

Grafana serves as the unified dashboard for all observability data. The platform ships with a set of pre-built dashboards that cover the most common monitoring scenarios:

Service Health — Request rates, error rates, latency percentiles (p50/p95/p99), and saturation metrics for each service.
Kubernetes Cluster — Node utilization, pod status, resource consumption, and scheduling metrics across all namespaces.
CI/CD Pipeline — Build durations, success rates, deployment frequency, and lead time for changes.
AI Platform — Model inference latency, throughput, GPU utilization, and queue depths.

Dashboards are provisioned as code via ConfigMaps in the Helm chart. To add a custom dashboard, export it as JSON from Grafana and add it to the chart's dashboards/ directory.

Prometheus Metrics

Prometheus collects metrics from all services using the ServiceMonitor CRD. Each service deployed with the base Helm charts automatically gets a ServiceMonitor that targets the /metrics endpoint at a 15-second scrape interval.

Common PromQL queries

promql

# Request rate per service
sum(rate(http_requests_total[5m])) by (service)
 
# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
 
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100
 
# CPU usage by namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Prometheus retention is set to 15 days for raw metrics. For longer-term storage, metrics are downsampled and shipped to an S3-backed Thanos instance.

Loki Logs

Loki provides centralized log aggregation with a label-based indexing approach that keeps costs low while maintaining fast query performance. Alloy runs as a DaemonSet on every node, collecting container logs and forwarding them to Loki.

LogQL query examples

logql

# All logs from a service
{namespace="dev-center", app="my-service"}
 
# Filter for errors with JSON parsing
{namespace="dev-center"} | json | level="error"
 
# Count errors per service over time
sum by (app) (count_over_time(
  {namespace="dev-center"} |= "error" [1h]
))
 
# Tail logs with regex filter
{app="my-service"} |~ "timeout|connection refused"

Loki stores logs for 30 days by default. Logs older than 30 days are automatically purged. For compliance-sensitive services, extend retention via the loki.retention Helm value.

Jaeger Traces

Jaeger provides distributed tracing across all services in the platform. Services instrumented with OpenTelemetry export traces via the OTLP protocol, and the Connect RPC interceptors automatically propagate trace context across service boundaries.

Traces are accessible through both the Jaeger UI and the Grafana Explore interface, where you can correlate traces with metrics and logs for comprehensive debugging.

OpenTelemetry configuration

bash

# Environment variables for OTLP export
OTEL_SERVICE_NAME=my-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy.monitoring:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

By default, traces are sampled at 10% to balance observability with storage costs. Critical paths can override the sampling rate using the OTEL_TRACES_SAMPLER_ARG environment variable or by setting sampling decisions at the application level.

Use the Grafana "Traces to Logs" and "Traces to Metrics" features to jump directly from a slow trace span to the corresponding logs and resource utilization metrics.

Next Steps

Kubernetes — Cluster architecture and workload management.
CI/CD — Continuous integration and deployment pipelines.
Infrastructure Overview — Cloud architecture and AWS services.

PreviousCI/CD NextInternal Docs Platform