Monitoring
Monitor service health, performance, and logs in real-time.
Monitoring Stack
The Platform's monitoring is built on an integrated observability stack consisting of Grafana, Prometheus, Loki, and Jaeger. Together these tools provide complete visibility into service health, performance metrics, logs, and distributed traces.
┌──────────────────────────────────────┐
│ Grafana (UI) │
├──────────┬───────────┬───────────────┤
│Prometheus│ Loki │ Jaeger │
│ Metrics │ Logs │ Traces │
├──────────┴───────────┴───────────────┤
│ Alloy (Collection Agent) │
├──────────────────────────────────────┤
│ Kubernetes Workloads │
└──────────────────────────────────────┘All components are deployed as Helm-managed workloads in the monitoring namespace and are pre-configured to work together out of the box.
Metrics
Metrics are collected by Prometheus using ServiceMonitor custom resources. Each service deployed with the base Helm charts automatically gets a ServiceMonitor that scrapes the standard /metrics endpoint.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-service
endpoints:
- port: http
path: /metrics
interval: 15sStandard metrics include HTTP request rates, response latencies (p50, p95, p99), error rates, and resource utilization. Custom application metrics can be added using the Prometheus client library.
Use the riven.dev/metrics: "true" label to enable automatic ServiceMonitor creation via the base Helm chart.
Logs
Centralized logging is powered by Loki with Alloy as the collection agent. Alloy runs as a DaemonSet on every node, tailing container logs and shipping them to Loki with structured labels for fast querying.
Logs are queryable through the Grafana Explore interface using LogQL. You can filter by namespace, service, pod, container, and any custom labels attached to the log stream.
# All error logs for a specific service
{namespace="dev-center", app="my-service"} |= "error"
# JSON-structured logs with status filtering
{namespace="dev-center"} | json | status >= 500
# Rate of error logs over time
rate({app="my-service"} |= "error" [5m])Traces
Distributed tracing is provided by Jaeger, which captures end-to-end request traces across service boundaries. Traces help you understand latency bottlenecks, dependency chains, and error propagation paths.
Services instrumented with OpenTelemetry automatically export traces to Jaeger. The Connect RPC interceptors include built-in trace context propagation, so cross-service traces work without additional configuration.
Traces are integrated with Grafana, allowing you to jump from a metric spike directly to the relevant traces and then to the associated logs — all in one workflow.
Alerting
Alerting is configured through Prometheus AlertManager with rules defined as PrometheusRule custom resources. Alerts can be routed to multiple notification channels including Slack, PagerDuty, and email.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-service-alerts
spec:
groups:
- name: my-service
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is above 5% for the last 5 minutes."The Platform dashboard surfaces active alerts alongside service health, making it easy to correlate alerts with specific deployments or configuration changes.