Ray Observability: How Ray, Prometheus & Grafana Fit Together
These notes are based on Ray 2.52.1. Code links point to that tag. Things may have changed in newer versions.
Contents:
- The 3 Pieces (TLDR)
- Key Files & Ports
- Where This Breaks: Shared Clusters
- A Fix: Decouple Prometheus from Ray’s Temp Dir in NeMo Curator
The 3 Pieces (TLDR)
Ray exports metrics. Prometheus scrapes and stores them. Grafana visualizes them.
That’s it. The plumbing between them:
- Ray Head Node writes
PROMETHEUS_SERVICE_DISCOVERY_FILE— a list of all nodes’ metrics endpoints. prometheus.ymlreferences this file so Prometheus knows where to scrape.- Grafana uses Prometheus as a data source and displays the metrics.
flowchart LR
RC["Ray Cluster (head node)"] -->|"writes every 5s"| SD["PROMETHEUS_SERVICE_DISCOVERY_FILE<br>.json"] <-->|"referenced in"| YML["prometheus.yml"]
YML -->|"configures"| P["Prometheus :9090"]
P -->|"data source"| G["Grafana :3000"]
Ray: Metrics Export
Every Ray node runs a metrics agent (part of the dashboard agent) that serves metrics over HTTP in Prometheus format.
- The port is
metrics_export_port(default8080) - One agent per node, one port per agent
- Ray components (raylet, workers, GCS) push metrics to this agent via gRPC
- The agent translates OpenCensus/OpenTelemetry data into Prometheus gauges, counters, and histograms
4 nodes = 4 HTTP endpoints serving metrics.
Code
# https://github.com/ray-project/ray/blob/ray-2.52.1/python/ray/dashboard/modules/reporter/reporter_agent.py#L448-L465
stats_exporter = prometheus_exporter.new_stats_exporter(
prometheus_exporter.Options(
namespace="ray",
port=dashboard_agent.metrics_export_port,
address="127.0.0.1" if self._ip == "127.0.0.1" else "",
)
)
MetricsAgent— records and proxies metricsReporterAgent— initializes the Prometheus exporter and MetricsAgentstart_raylet— passes--metrics-export-portto the dashboard agent
Prometheus: Service Discovery
Prometheus needs to know where to scrape. Ray handles this via file-based service discovery.
The head node runs a PrometheusServiceDiscoveryWriter thread that:
- Calls
ray.nodes()to get all alive nodes - Collects each node’s
MetricsExportPort - Writes the list to a JSON file
The file: {RAY_TEMP_DIR}/prom_metrics_service_discovery.json
Default: /tmp/ray/prom_metrics_service_discovery.json (constant)
Format:
[{
"labels": {"job": "ray"},
"targets": ["10.0.0.1:8080", "10.0.0.2:8080", "10.0.0.3:8080"]
}]
Rewritten every 5 seconds. New nodes appear automatically, dead nodes get dropped.
Code
# https://github.com/ray-project/ray/blob/ray-2.52.1/python/ray/_private/metrics_agent.py#L792-L808
nodes = ray.nodes()
metrics_export_addresses = [
build_address(node["NodeManagerAddress"], node["MetricsExportPort"])
for node in nodes
if node["alive"] is True
]
# some code skipped
content = [{"labels": {"job": "ray"}, "targets": metrics_export_addresses}]
with self._content_lock:
self.latest_service_discovery_content = content
return json.dumps(content)
get_file_discovery_content— builds the targets list from alive nodeswrite— atomic write to diskReportHead— instantiates the writer and starts it as a daemon thread
Prometheus: Configuration
Prometheus needs a prometheus.yml that defines:
- Where to find the discovery file
- How often to scrape (default: every 10s)
global:
scrape_interval: 10s
evaluation_interval: 10s
scrape_configs:
- job_name: 'ray'
file_sd_configs:
- files:
- '/tmp/ray/prom_metrics_service_discovery.json'
Ray generates this from a template and writes it to {SESSION_DIR}/metrics/prometheus/prometheus.yml in _create_default_prometheus_configs.
Prometheus watches the discovery file for changes — no restart needed when nodes join or leave.
Watch out: Ray also ships a static prometheus.yml that hardcodes /tmp/ray/prom_metrics_service_discovery.json. If you use a custom temp dir, make sure you’re using the generated config, not the static one.
Grafana: Dashboards
Grafana points to Prometheus as a data source (http://localhost:9090) and loads pre-built dashboards.
Ray bundles several dashboard generators:
- Default — cluster-wide metrics (CPU, memory, object store, etc.)
- Data — Ray Data specific metrics
- Serve — Ray Serve metrics
- Serve Deployment — per-deployment metrics
These are JSON files that Grafana auto-provisions from a dashboards directory.
The Key Files & Ports
Port / File Who uses it
─────────────────────────────────────────────────────────────────
metrics_export_port (default 8080) Ray → Prometheus
Each node's HTTP endpoint serving metrics
prom_metrics_service_discovery.json Ray → Prometheus
{RAY_TEMP_DIR}/prom_metrics_service_discovery.json
List of all nodes' metrics endpoints, updated every 5s
prometheus.yml Prometheus
{SESSION_DIR}/metrics/prometheus/prometheus.yml
Tells Prometheus where the discovery file is
Prometheus web port (default 9090) Prometheus → Grafana
Where Prometheus serves its query API
Grafana web port (default 3000) Grafana → You
Where you see the dashboards
Where This Breaks: Shared / Multi-User Clusters
The whole setup assumes one head node, one temp dir, one Prometheus instance. That falls apart on shared machines.
Problem 1: Temp Dir Collision
Default RAY_TEMP_DIR is /tmp/ray. Multiple users starting Ray clusters on the same machine means every head node writes to the same prom_metrics_service_discovery.json every 5 seconds. Last writer wins. Metrics endpoints keep appearing and disappearing.
Problem 2: Custom Temp Dir Breaks Prometheus
Setting --temp-dir /tmp/ray_alice makes Ray write the discovery file to /tmp/ray_alice/prom_metrics_service_discovery.json.
But the static prometheus.yml that ships with Ray hardcodes:
- files:
- '/tmp/ray/prom_metrics_service_discovery.json'
Prometheus reads from /tmp/ray/, Ray writes to /tmp/ray_alice/. Nothing matches. No metrics.
Ray does generate a dynamic prometheus.yml with the correct temp dir at {session_dir}/metrics/prometheus/prometheus.yml — but you need to know to point Prometheus at that one instead.
Problem 3: Port Conflicts
Default Prometheus port is 9090, Grafana is 3000. Two users starting monitoring on the same machine will collide.
A Fix: Decouple Prometheus from Ray’s Temp Dir in NeMo Curator
The approach we take in NeMo Curator #1523:
- Start Prometheus with an empty
file_sd_configs: []and--web.enable-lifecycle - When a Ray cluster starts, dynamically append its discovery file path to the Prometheus config
- Hot-reload Prometheus via
POST /-/reload - On shutdown, remove the entry and reload again
This avoids hardcoded paths — Prometheus always points at the right discovery file regardless of the temp dir.