Monitoring

Monitor service health, performance, and logs in real-time.

Monitoring Stack

The Platform's monitoring is built on an integrated observability stack consisting of Grafana, Prometheus, Loki, and Jaeger. Together these tools provide complete visibility into service health, performance metrics, logs, and distributed traces.

Monitoring stack

text

┌──────────────────────────────────────┐
│            Grafana (UI)              │
├──────────┬───────────┬───────────────┤
│Prometheus│   Loki    │    Jaeger     │
│ Metrics  │   Logs    │   Traces     │
├──────────┴───────────┴───────────────┤
│     Alloy (Collection Agent)         │
├──────────────────────────────────────┤
│       Kubernetes Workloads           │
└──────────────────────────────────────┘

All components are deployed as Helm-managed workloads in the monitoring namespace and are pre-configured to work together out of the box.

Metrics

Metrics are collected by Prometheus using ServiceMonitor custom resources. Each service deployed with the base Helm charts automatically gets a ServiceMonitor that scrapes the standard /metrics endpoint.

servicemonitor.yaml

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: my-service
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Standard metrics include HTTP request rates, response latencies (p50, p95, p99), error rates, and resource utilization. Custom application metrics can be added using the Prometheus client library.

Use the riven.dev/metrics: "true" label to enable automatic ServiceMonitor creation via the base Helm chart.

Logs

Centralized logging is powered by Loki with Alloy as the collection agent. Alloy runs as a DaemonSet on every node, tailing container logs and shipping them to Loki with structured labels for fast querying.

Logs are queryable through the Grafana Explore interface using LogQL. You can filter by namespace, service, pod, container, and any custom labels attached to the log stream.

LogQL query examples

logql

# All error logs for a specific service
{namespace="dev-center", app="my-service"} |= "error"
 
# JSON-structured logs with status filtering
{namespace="dev-center"} | json | status >= 500
 
# Rate of error logs over time
rate({app="my-service"} |= "error" [5m])

Traces

Distributed tracing is provided by Jaeger, which captures end-to-end request traces across service boundaries. Traces help you understand latency bottlenecks, dependency chains, and error propagation paths.

Services instrumented with OpenTelemetry automatically export traces to Jaeger. The Connect RPC interceptors include built-in trace context propagation, so cross-service traces work without additional configuration.

Traces are integrated with Grafana, allowing you to jump from a metric spike directly to the relevant traces and then to the associated logs — all in one workflow.

Alerting

Alerting is configured through Prometheus AlertManager with rules defined as PrometheusRule custom resources. Alerts can be routed to multiple notification channels including Slack, PagerDuty, and email.

alert-rules.yaml

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-service-alerts
spec:
  groups:
    - name: my-service
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "Error rate is above 5% for the last 5 minutes."

The Platform dashboard surfaces active alerts alongside service health, making it easy to correlate alerts with specific deployments or configuration changes.

PreviousDeployments NextAI Platform Overview