Building a robust monitoring infrastructure is crucial for maintaining healthy systems and applications. Prometheus and Grafana form a powerful combination for metrics collection, storage, and visualization. This comprehensive guide walks you through deploying a complete monitoring stack on Rocky Linux, from basic setup to advanced configurations and custom dashboards.
Table of Contents
- Understanding Prometheus and Grafana
- Architecture Overview
- Installing Prometheus
- Installing Grafana
- Configuring Prometheus
- Setting Up Exporters
- Integrating Prometheus with Grafana
- Creating Dashboards
- Alerting Configuration
- Service Discovery
- Security Hardening
- Performance Optimization
- High Availability Setup
- Troubleshooting
- Best Practices
Understanding Prometheus and Grafana
What is Prometheus?
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. Key features include:
- Pull-based metrics collection: Prometheus scrapes metrics from configured targets
- Time-series database: Efficient storage of metrics with timestamps
- Powerful query language: PromQL for data retrieval and analysis
- Service discovery: Automatic discovery of monitoring targets
- Built-in alerting: Alert rules and integration with Alertmanager
What is Grafana?
Grafana is a multi-platform analytics and visualization platform that supports multiple data sources:
- Beautiful dashboards: Create stunning visualizations of your metrics
- Multiple data sources: Supports Prometheus, InfluxDB, Elasticsearch, and more
- Alerting: Visual alert rules with multiple notification channels
- User management: Role-based access control and teams
- Plugin ecosystem: Extend functionality with community plugins
Why Use Them Together?
| Feature | Prometheus | Grafana | Combined Benefit | 
|---|---|---|---|
| Data Collection | ✓ | ✗ | Reliable metrics gathering | 
| Storage | ✓ | ✗ | Efficient time-series storage | 
| Visualization | Basic | ✓ | Professional dashboards | 
| Alerting | Rule-based | Visual | Comprehensive alerting | 
| User Interface | Minimal | Rich | User-friendly monitoring | 
Architecture Overview
Component Architecture
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Applications  │────▶│   Exporters     │◀────│   Prometheus    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Alertmanager  │◀────│   Alert Rules   │     │     Grafana     │
└─────────────────┘     └─────────────────┘     └─────────────────┘Network Ports
- Prometheus: 9090 (web UI and API)
- Grafana: 3000 (web UI)
- Node Exporter: 9100
- Alertmanager: 9093
- Pushgateway: 9091
Installing Prometheus
Prerequisites
# Update system
sudo dnf update -y
# Install dependencies
sudo dnf install -y wget curl tar
# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /var/log/prometheusDownload and Install Prometheus
# Set version
PROMETHEUS_VERSION="2.45.0"
# Download Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
# Extract archive
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
# Copy binaries
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool /usr/local/bin/
# Copy configuration files
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/console_libraries /etc/prometheus
# Set ownership
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /var/log/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtoolCreate Prometheus Configuration
# Create basic configuration
sudo nano /etc/prometheus/prometheus.yml# Global configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    monitor: 'prometheus-stack'
    environment: 'production'
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093
# Load rules once and periodically evaluate them
rule_files:
  - "rules/*.yml"
# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'
  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: 'prometheus-node'
  # Grafana
  - job_name: 'grafana'
    static_configs:
      - targets: ['localhost:3000']Create Systemd Service
sudo nano /etc/systemd/system/prometheus.service[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.enable-lifecycle \
  --web.enable-admin-api \
  --log.level=info \
  --log.format=logfmt
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
SyslogIdentifier=prometheus
KillMode=mixed
KillSignal=SIGTERM
[Install]
WantedBy=multi-user.targetStart Prometheus
# Reload systemd
sudo systemctl daemon-reload
# Enable and start Prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus
# Check status
sudo systemctl status prometheus
# Check logs
sudo journalctl -u prometheus -f
# Verify installation
curl http://localhost:9090/metricsInstalling Grafana
Method 1: Install from Repository
# Add Grafana repository
sudo nano /etc/yum.repos.d/grafana.repo[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt# Install Grafana
sudo dnf install -y grafana
# Enable and start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
# Check status
sudo systemctl status grafana-serverMethod 2: Install from Binary
# Set version
GRAFANA_VERSION="10.0.3"
# Download Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz
# Extract and install
tar -zxvf grafana-${GRAFANA_VERSION}.linux-amd64.tar.gz
sudo mv grafana-${GRAFANA_VERSION} /opt/grafana
# Create user
sudo useradd --no-create-home --shell /bin/false grafana
# Set permissions
sudo chown -R grafana:grafana /opt/grafana
# Create systemd service
sudo nano /etc/systemd/system/grafana.service[Unit]
Description=Grafana
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
User=grafana
Group=grafana
ExecStart=/opt/grafana/bin/grafana-server \
  --config=/opt/grafana/conf/defaults.ini \
  --homepath=/opt/grafana
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.targetConfigure Grafana
# Edit Grafana configuration
sudo nano /etc/grafana/grafana.ini[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = monitoring.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false
[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = false
cookie_samesite = lax
allow_embedding = false
[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
[auth.anonymous]
enabled = false
[auth.basic]
enabled = true
[database]
type = sqlite3
path = grafana.db
[session]
provider = file
provider_config = sessions
[analytics]
reporting_enabled = false
check_for_updates = false
[log]
mode = console file
level = info
filters = 
[alerting]
enabled = true
execute_alerts = trueConfiguring Prometheus
Advanced Configuration
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  query_log_file: /var/log/prometheus/query.log
  external_labels:
    cluster: 'production'
    replica: '1'
# Remote write configuration (optional)
remote_write:
  - url: 'http://remote-storage:9201/write'
    queue_config:
      capacity: 10000
      max_shards: 5
      max_samples_per_send: 1000
# Alerting configuration
alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
          - 'alertmanager:9093'
      timeout: 10s
# Rule files
rule_files:
  - "/etc/prometheus/rules/alerts.yml"
  - "/etc/prometheus/rules/recording.yml"
# Scrape configurations
scrape_configs:
  # Service discovery for node exporters
  - job_name: 'node-exporter'
    consul_sd_configs:
      - server: 'consul:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_node]
        target_label: instance
      - source_labels: [__meta_consul_tags]
        regex: '.*,production,.*'
        action: keep
  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
  # File-based service discovery
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
        - '/etc/prometheus/file_sd/*.json'
        refresh_interval: 5m
  # Static targets with different intervals
  - job_name: 'high-frequency'
    scrape_interval: 5s
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
        labels:
          env: 'production'
          team: 'backend'
  # Blackbox exporter for endpoint monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115Create Alert Rules
# Create rules directory
sudo mkdir -p /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus/rules
# Create alert rules
sudo nano /etc/prometheus/rules/alerts.ymlgroups:
  - name: node_alerts
    interval: 30s
    rules:
      # Node down
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 5 minutes."
      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"
      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"
      # Disk space low
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 15% (current value: {{ $value }}%)"
      # High load average
      - alert: HighLoadAverage
        expr: node_load1 > (count by(instance)(node_cpu_seconds_total{mode="idle"})) * 2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High load average on {{ $labels.instance }}"
          description: "Load average is high (current value: {{ $value }})"
  - name: prometheus_alerts
    rules:
      # Prometheus target down
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target {{ $labels.job }}/{{ $labels.instance }} is down"
          description: "Target has been down for more than 5 minutes."
      # Too many scrape errors
      - alert: PrometheusScrapingError
        expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus scraping error"
          description: "Prometheus has scraping errors for {{ $labels.job }}/{{ $labels.instance }}"
      # Prometheus config reload failed
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful != 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus configuration reload failed"
          description: "Prometheus configuration reload has failed"Create Recording Rules
sudo nano /etc/prometheus/rules/recording.ymlgroups:
  - name: node_recording
    interval: 30s
    rules:
      # CPU usage percentage
      - record: instance:node_cpu_utilisation:rate5m
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
      # Memory usage percentage
      - record: instance:node_memory_utilisation:percentage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
      # Disk usage percentage
      - record: instance:node_filesystem_utilisation:percentage
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
      # Network receive bandwidth
      - record: instance:node_network_receive_bytes:rate5m
        expr: sum by(instance) (rate(node_network_receive_bytes_total[5m]))
      # Network transmit bandwidth
      - record: instance:node_network_transmit_bytes:rate5m
        expr: sum by(instance) (rate(node_network_transmit_bytes_total[5m]))
  - name: aggregated_metrics
    interval: 60s
    rules:
      # Average CPU across all nodes
      - record: job:node_cpu_utilisation:avg
        expr: avg(instance:node_cpu_utilisation:rate5m)
      # Total memory usage
      - record: job:node_memory_bytes:sum
        expr: sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)Setting Up Exporters
Node Exporter
# Download Node Exporter
NODE_EXPORTER_VERSION="1.6.1"
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter
# Set permissions
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Create systemd service
sudo nano /etc/systemd/system/node_exporter.service[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
  --collector.netclass.ignored-devices=^(veth.*)$$ \
  --web.listen-address=:9100 \
  --web.telemetry-path=/metrics
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporterBlackbox Exporter
# Download Blackbox Exporter
BLACKBOX_VERSION="0.24.0"
cd /tmp
wget https://github.com/prometheus/blackbox_exporter/releases/download/v${BLACKBOX_VERSION}/blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64.tar.gz
sudo cp blackbox_exporter-${BLACKBOX_VERSION}.linux-amd64/blackbox_exporter /usr/local/bin/
# Create configuration
sudo mkdir -p /etc/blackbox_exporter
sudo nano /etc/blackbox_exporter/blackbox.ymlmodules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": "data"}'
  tcp_connect:
    prober: tcp
    timeout: 5s
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
  dns_tcp:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      transport_protocol: "tcp"MySQL/MariaDB Exporter
# Download MySQL Exporter
MYSQL_EXPORTER_VERSION="0.15.0"
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v${MYSQL_EXPORTER_VERSION}/mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp mysqld_exporter-${MYSQL_EXPORTER_VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/
# Create MySQL user for exporter
mysql -u root -p << EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF
# Create credentials file
sudo mkdir -p /etc/mysqld_exporter
sudo nano /etc/mysqld_exporter/.my.cnf[client]
host=localhost
port=3306
user=exporter
password=ExporterPassword123!# Set permissions
sudo chmod 600 /etc/mysqld_exporter/.my.cnf
sudo chown prometheus:prometheus /etc/mysqld_exporter/.my.cnfCustom Application Metrics
# Example Python application with Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random
# Define metrics
request_count = Counter('app_requests_total', 'Total number of requests', ['method', 'endpoint'])
request_duration = Histogram('app_request_duration_seconds', 'Request duration', ['method', 'endpoint'])
active_users = Gauge('app_active_users', 'Number of active users')
# Expose metrics
start_http_server(8000)
# Application logic
while True:
    # Simulate requests
    method = random.choice(['GET', 'POST'])
    endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
    
    with request_duration.labels(method=method, endpoint=endpoint).time():
        # Simulate processing time
        time.sleep(random.random())
        
    request_count.labels(method=method, endpoint=endpoint).inc()
    active_users.set(random.randint(50, 200))
    
    time.sleep(1)Integrating Prometheus with Grafana
Add Prometheus Data Source
# Using Grafana API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true,
    "jsonData": {
      "timeInterval": "15s",
      "queryTimeout": "60s",
      "httpMethod": "POST"
    }
  }'Configure Data Source in UI
- Navigate to Configuration → Data Sources
- Click “Add data source”
- Select “Prometheus”
- Configure:
- URL: http://localhost:9090
- Access: Server (default)
- Scrape interval: 15s
- Query timeout: 60s
- HTTP Method: POST
 
- URL: 
Creating Dashboards
Import Community Dashboards
# Popular dashboard IDs:
# 1860 - Node Exporter Full
# 7362 - MySQL Overview
# 3662 - Prometheus 2.0 Overview
# 11074 - Node Exporter for Prometheus
# Import via API
curl -X POST http://admin:StrongAdminPassword123!@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {
      "id": 1860,
      "uid": null,
      "title": "Node Exporter Full"
    },
    "overwrite": true,
    "inputs": [{
      "name": "DS_PROMETHEUS",
      "type": "datasource",
      "pluginId": "prometheus",
      "value": "Prometheus"
    }]
  }'Create Custom Dashboard
{
  "dashboard": {
    "title": "System Metrics Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "type": "graph",
        "targets": [{
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}}",
          "refId": "A"
        }],
        "yaxes": [{
          "format": "percent",
          "min": 0,
          "max": 100
        }]
      },
      {
        "title": "Memory Usage",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "type": "graph",
        "targets": [{
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
          "legendFormat": "{{instance}}",
          "refId": "A"
        }],
        "yaxes": [{
          "format": "percent",
          "min": 0,
          "max": 100
        }]
      },
      {
        "title": "Disk I/O",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_disk_read_bytes_total[5m])",
            "legendFormat": "{{instance}} - Read",
            "refId": "A"
          },
          {
            "expr": "rate(node_disk_written_bytes_total[5m])",
            "legendFormat": "{{instance}} - Write",
            "refId": "B"
          }
        ],
        "yaxes": [{
          "format": "Bps"
        }]
      },
      {
        "title": "Network Traffic",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{device!~\"lo\"}[5m])",
            "legendFormat": "{{instance}} - {{device}} RX",
            "refId": "A"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m])",
            "legendFormat": "{{instance}} - {{device}} TX",
            "refId": "B"
          }
        ],
        "yaxes": [{
          "format": "Bps"
        }]
      }
    ],
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "10s"
  }
}Dashboard Best Practices
- 
Organization - Use folders for different environments
- Consistent naming conventions
- Version control dashboard JSON
 
- 
Design - Group related metrics
- Use appropriate visualization types
- Consistent color schemes
- Meaningful panel titles
 
- 
Performance - Limit time ranges
- Use recording rules for complex queries
- Avoid too many panels per dashboard
- Set appropriate refresh intervals
 
Alerting Configuration
Alertmanager Setup
# Download Alertmanager
ALERTMANAGER_VERSION="0.26.0"
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool /usr/local/bin/
# Create directories
sudo mkdir -p /etc/alertmanager
sudo mkdir -p /var/lib/alertmanager
# Create configuration
sudo nano /etc/alertmanager/alertmanager.ymlglobal:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'
  smtp_require_tls: true
templates:
  - '/etc/alertmanager/templates/*.tmpl'
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-ops'
  
  routes:
    - match:
        severity: critical
      receiver: 'team-ops-critical'
      continue: true
      
    - match:
        team: database
      receiver: 'team-database'
      
    - match_re:
        service: ^(frontend|backend)$
      receiver: 'team-dev'
receivers:
  - name: 'team-ops'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Prometheus Alert: {{ .GroupLabels.alertname }}'
    
  - name: 'team-ops-critical'
    email_configs:
      - to: '[email protected]'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'
        
  - name: 'team-database'
    email_configs:
      - to: '[email protected]'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#database-alerts'
        
  - name: 'team-dev'
    webhook_configs:
      - url: 'http://webhook.example.com/prometheus'
        send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']Create Alert Templates
sudo mkdir -p /etc/alertmanager/templates
sudo nano /etc/alertmanager/templates/custom.tmpl{{ define "custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}
{{ define "custom.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
*Value:* {{ .Value }}
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}
{{ define "custom.slack.text" }}
{{ range .Alerts }}
:{{ if eq .Status "firing" }}red_circle{{ else }}green_circle{{ end }}: *{{ .Annotations.summary }}*
{{ .Annotations.description }}
*Severity:* `{{ .Labels.severity }}`
*Instance:* `{{ .Labels.instance }}`
*Value:* `{{ .Value }}`
{{ end }}
{{ end }}Grafana Alerting
# Configure Grafana alerting
sudo nano /etc/grafana/grafana.ini[unified_alerting]
enabled = true
execute_alerts = true
evaluation_timeout = 30s
notification_timeout = 30s
max_attempts = 3
min_interval = 10s
[unified_alerting.screenshots]
capture = true
capture_timeout = 10s
max_concurrent_screenshots = 5
upload_external_image_storage = falseCreate Grafana Alert Rules
{
  "uid": "cpu-alert",
  "title": "High CPU Usage Alert",
  "condition": "A",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "model": {
        "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
        "refId": "A"
      },
      "datasourceUid": "prometheus-uid",
      "conditions": [
        {
          "evaluator": {
            "params": [80],
            "type": "gt"
          },
          "operator": {
            "type": "and"
          },
          "query": {
            "params": ["A"]
          },
          "reducer": {
            "params": [],
            "type": "avg"
          },
          "type": "query"
        }
      ],
      "reducer": "last",
      "expression": "A"
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "5m",
  "annotations": {
    "description": "CPU usage is above 80% on {{ $labels.instance }}",
    "runbook_url": "https://wiki.example.com/runbooks/cpu-high",
    "summary": "High CPU usage detected"
  },
  "labels": {
    "severity": "warning",
    "team": "ops"
  }
}Service Discovery
Consul Integration
# Prometheus configuration for Consul
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        token: 'your-consul-token'
        datacenter: 'dc1'
        tag_separator: ','
        scheme: 'http'
        services: []  # All services
        
    relabel_configs:
      # Keep only services with 'prometheus' tag
      - source_labels: [__meta_consul_tags]
        regex: '.*,prometheus,.*'
        action: keep
        
      # Use service name as job label
      - source_labels: [__meta_consul_service]
        target_label: job
        
      # Use node name as instance label
      - source_labels: [__meta_consul_node]
        target_label: instance
        
      # Extract custom metrics path from tags
      - source_labels: [__meta_consul_tags]
        regex: '.*,metrics_path=([^,]+),.*'
        target_label: __metrics_path__
        replacement: '${1}'Kubernetes Service Discovery
# Kubernetes pods discovery
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
            - production
            
    relabel_configs:
      # Only scrape pods with prometheus annotations
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
        
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        
      # Use custom path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
        
      # Add kubernetes labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
        
      # Add namespace
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
        
      # Add pod name
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_nameFile-based Service Discovery
# Create file SD directory
sudo mkdir -p /etc/prometheus/file_sd
# Example targets file
sudo nano /etc/prometheus/file_sd/webservers.json[
  {
    "targets": ["web1.example.com:9100", "web2.example.com:9100"],
    "labels": {
      "env": "production",
      "role": "webserver",
      "datacenter": "us-east-1"
    }
  },
  {
    "targets": ["web3.example.com:9100"],
    "labels": {
      "env": "staging",
      "role": "webserver",
      "datacenter": "us-west-2"
    }
  }
]DNS Service Discovery
scrape_configs:
  - job_name: 'dns-srv-records'
    dns_sd_configs:
      - names:
          - '_prometheus._tcp.example.com'
        type: 'SRV'
        refresh_interval: 30s
        
    relabel_configs:
      - source_labels: [__meta_dns_name]
        target_label: instance
        regex: '([^.]+)\..*'
        replacement: '${1}'Security Hardening
Prometheus Security
# Enable basic authentication
sudo dnf install -y httpd-tools
# Generate password hash
htpasswd -nBC 10 "" | tr -d ':\n'
# Configure Prometheus
sudo nano /etc/prometheus/web.ymlbasic_auth_users:
  admin: $2y$10$V2RmZ2wKC7cPiE3o/h3gLuUBUPGIM2Qm0x0W8X0gAB3sLNkVE3tEq
  prometheus: $2y$10$93m/Gk5HzNxwGqDG3zSJxuYCKNneOU5W.AXFyiKJhDRIAHsQBGtFa
tls_server_config:
  cert_file: /etc/prometheus/prometheus.crt
  key_file: /etc/prometheus/prometheus.key
  client_auth_type: RequireAndVerifyClientCert
  client_ca_file: /etc/prometheus/ca.crt# Update Prometheus service
sudo nano /etc/systemd/system/prometheus.service
# Add to ExecStart:
--web.config.file=/etc/prometheus/web.yml
# Restart Prometheus
sudo systemctl daemon-reload
sudo systemctl restart prometheusGrafana Security
# /etc/grafana/grafana.ini
[security]
admin_user = admin
admin_password = StrongAdminPassword123!
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict
strict_transport_security = true
strict_transport_security_max_age_seconds = 86400
strict_transport_security_preload = true
strict_transport_security_subdomains = true
x_content_type_options = true
x_xss_protection = true
content_security_policy = true
[auth]
disable_login_form = false
disable_signout_menu = false
oauth_auto_login = false
[auth.anonymous]
enabled = false
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = true
[auth.proxy]
enabled = false
[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
viewers_can_edit = false
editors_can_admin = false
[database]
ssl_mode = require
ca_cert_path = /etc/grafana/ca.crt
client_key_path = /etc/grafana/client.key
client_cert_path = /etc/grafana/client.crt
server_cert_name = grafana.example.comNetwork Security
# Configure firewall
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-service=https
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=9093/tcp
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --reload
# Restrict access by source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port port="9090" protocol="tcp" accept'
sudo firewall-cmd --reloadSSL/TLS Configuration
# Generate certificates
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
  -keyout prometheus.key -out prometheus.crt \
  -subj "/C=US/ST=State/L=City/O=Organization/CN=prometheus.example.com"
# Configure Nginx reverse proxy
sudo dnf install -y nginx
sudo nano /etc/nginx/conf.d/monitoring.conf# Prometheus
server {
    listen 443 ssl http2;
    server_name prometheus.example.com;
    
    ssl_certificate /etc/nginx/ssl/prometheus.crt;
    ssl_certificate_key /etc/nginx/ssl/prometheus.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    
    location / {
        proxy_pass http://localhost:9090;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        auth_basic "Prometheus";
        auth_basic_user_file /etc/nginx/.htpasswd;
    }
}
# Grafana
server {
    listen 443 ssl http2;
    server_name grafana.example.com;
    
    ssl_certificate /etc/nginx/ssl/grafana.crt;
    ssl_certificate_key /etc/nginx/ssl/grafana.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}Performance Optimization
Prometheus Optimization
# Storage optimization
storage:
  tsdb:
    # Retention settings
    retention.time: 30d
    retention.size: 100GB
    
    # WAL compression
    wal-compression: true
    
    # Block duration
    min-block-duration: 2h
    max-block-duration: 48h
# Query optimization
query:
  # Concurrent queries
  max-concurrency: 20
  
  # Query timeout
  timeout: 2m
  
  # Max samples per query
  max-samples: 50000000
  
  # Lookback delta
  lookback-delta: 5m
# Scrape optimization
scrape_configs:
  - job_name: 'optimized'
    # Increase scrape interval for less critical metrics
    scrape_interval: 60s
    
    # Reduce scrape timeout
    scrape_timeout: 10s
    
    # Limit sample size
    sample_limit: 10000
    
    # Limit label count
    label_limit: 30
    
    # Limit label name length
    label_name_length_limit: 200
    
    # Limit label value length
    label_value_length_limit: 200Recording Rules for Performance
groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-calculate expensive queries
      - record: job:node_cpu:avg_rate5m
        expr: avg by(job) (rate(node_cpu_seconds_total[5m]))
        
      - record: job:node_memory:usage_percentage
        expr: |
          100 * (1 - (
            sum by(job) (node_memory_MemAvailable_bytes)
            /
            sum by(job) (node_memory_MemTotal_bytes)
          ))
          
      - record: instance:node_filesystem:usage_percentage
        expr: |
          100 - (
            100 * node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}
            / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}
          )Grafana Performance
# Database optimization
[database]
max_open_conn = 100
max_idle_conn = 100
conn_max_lifetime = 14400
# Caching
[caching]
enabled = true
# Data proxy
[dataproxy]
timeout = 30
keep_alive_seconds = 30
tls_handshake_timeout_seconds = 10
expect_continue_timeout_seconds = 1
max_idle_connections = 100
idle_conn_timeout_seconds = 90
# Rendering
[rendering]
concurrent_render_limit = 5
# Query caching
[feature_toggles]
enable = queryCachingQuery Optimization Tips
- 
Use Recording Rules - Pre-calculate expensive queries
- Aggregate data at collection time
- Reduce query complexity
 
- 
Optimize PromQL # Bad: Multiple aggregations avg(rate(http_requests_total[5m])) by (job) # Good: Single aggregation avg by (job) (rate(http_requests_total[5m]))
- 
Limit Time Ranges - Use appropriate time ranges
- Avoid querying old data unnecessarily
- Use downsampling for historical data
 
- 
Index Labels Properly - Keep cardinality in check
- Use meaningful label names
- Avoid high-cardinality labels
 
High Availability Setup
Prometheus HA Configuration
# prometheus-1.yml
global:
  scrape_interval: 15s
  external_labels:
    replica: '1'
    cluster: 'prod'
# prometheus-2.yml  
global:
  scrape_interval: 15s
  external_labels:
    replica: '2'
    cluster: 'prod'Using Thanos for HA
# Install Thanos
THANOS_VERSION="0.32.0"
wget https://github.com/thanos-io/thanos/releases/download/v${THANOS_VERSION}/thanos-${THANOS_VERSION}.linux-amd64.tar.gz
tar xvf thanos-${THANOS_VERSION}.linux-amd64.tar.gz
sudo cp thanos-${THANOS_VERSION}.linux-amd64/thanos /usr/local/bin/
# Configure Thanos Sidecar
sudo nano /etc/systemd/system/thanos-sidecar.service[Unit]
Description=Thanos Sidecar
After=prometheus.service
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/thanos sidecar \
  --tsdb.path=/var/lib/prometheus \
  --prometheus.url=http://localhost:9090 \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902
[Install]
WantedBy=multi-user.targetGrafana HA with Database
# Use external database for HA
[database]
type = postgres
host = postgres.example.com:5432
name = grafana
user = grafana
password = SecurePassword123!
ssl_mode = require
ca_cert_path = /etc/grafana/ca.pem
[session]
provider = postgres
provider_config = user=grafana password=SecurePassword123! host=postgres.example.com port=5432 dbname=grafana sslmode=require
[remote_cache]
type = redis
connstr = redis.example.com:6379Troubleshooting
Common Prometheus Issues
# Check Prometheus configuration
promtool check config /etc/prometheus/prometheus.yml
# Check rule files
promtool check rules /etc/prometheus/rules/*.yml
# Test service discovery
curl http://localhost:9090/api/v1/targets
# Check metrics ingestion
curl http://localhost:9090/api/v1/query?query=up
# Debug scraping issues
curl http://localhost:9090/api/v1/targets/metadata
# Check storage
du -sh /var/lib/prometheus/*
# Analyze cardinality
curl http://localhost:9090/api/v1/label/__name__/values | jq lengthCommon Grafana Issues
# Check Grafana logs
sudo journalctl -u grafana-server -f
# Test data source
curl -u admin:password http://localhost:3000/api/datasources
# Check plugin installation
grafana-cli plugins ls
# Database issues
grafana-cli admin data-migration
# Reset admin password
grafana-cli admin reset-admin-password newpasswordPerformance Diagnostics
# Prometheus metrics about itself
curl http://localhost:9090/metrics | grep prometheus_
# Key metrics to check:
# - prometheus_tsdb_head_samples
# - prometheus_tsdb_symbol_table_size_bytes
# - prometheus_tsdb_head_chunks
# - prometheus_engine_query_duration_seconds
# - prometheus_http_request_duration_seconds
# Grafana performance metrics
curl http://localhost:3000/metrics | grep grafana_
# System resource usage
top -p $(pgrep prometheus)
top -p $(pgrep grafana)Best Practices
Monitoring Best Practices
- 
Label Management - Keep label cardinality under control
- Use consistent label naming
- Avoid dynamic label values
- Document label meanings
 
- 
Query Optimization - Use recording rules for dashboards
- Limit query time ranges
- Avoid regex where possible
- Cache frequently used queries
 
- 
Alert Design - Alert on symptoms, not causes
- Include runbook links
- Set appropriate thresholds
- Test alerts regularly
 
- 
Dashboard Design - Group related metrics
- Use consistent layouts
- Include documentation
- Version control dashboards
 
Operational Best Practices
- 
Backup Strategy # Backup Prometheus data tar -czf prometheus-backup-$(date +%Y%m%d).tar.gz /var/lib/prometheus # Backup Grafana cp /var/lib/grafana/grafana.db grafana-backup-$(date +%Y%m%d).db grafana-cli admin export-dashboard-json
- 
Monitoring the Monitors - Monitor Prometheus with another instance
- Set up alerts for monitoring stack
- Track resource usage
- Monitor scrape performance
 
- 
Capacity Planning - Monitor storage growth
- Track cardinality increases
- Plan for retention needs
- Scale before hitting limits
 
- 
Documentation - Document architecture
- Maintain runbooks
- Record configuration decisions
- Keep dashboard documentation
 
Security Best Practices
- 
Access Control - Use strong authentication
- Implement RBAC
- Audit access logs
- Regular permission reviews
 
- 
Network Security - Use TLS everywhere
- Restrict network access
- Implement firewall rules
- Use VPN for remote access
 
- 
Data Protection - Encrypt data at rest
- Secure backups
- Limit data retention
- Anonymize sensitive data
 
Conclusion
Deploying Prometheus and Grafana on Rocky Linux provides a powerful, scalable monitoring solution for modern infrastructure. This guide has covered:
- Complete installation and configuration
- Integration and dashboard creation
- Advanced features like service discovery
- Security hardening and best practices
- Performance optimization techniques
- High availability configurations
Remember that monitoring is an iterative process. Start with basic metrics, gradually add more sophisticated monitoring, and continuously refine based on your needs.
 
   
   
  