solidity
ubuntu
+
rs
ractive
scala
nvim
mocha
<=
+
js
composer
+
angular
parcel
gcp
ocaml
+
chef
firebase
istio
bun
+
raspbian
alpine
+
+
+
+
+
alpine
+
+
+
+
+
//
+
+
+
nomad
+
+
postgres
+
+
+
symfony
+
+
λ
htmx
+
dns
bsd
+
+
+
elixir
+
macos
+
rails
+
+
c++
+
+
preact
+
ios
+
+
+
elm
vue
spacy
r
composer
bun
supabase
+
+
+
quarkus
+
windows
Back to Blog
Building a Monitoring Stack with Prometheus and Grafana on AlmaLinux
almalinux prometheus grafana

Building a Monitoring Stack with Prometheus and Grafana on AlmaLinux

Published Jul 15, 2025

Deploy a complete monitoring solution using Prometheus and Grafana on AlmaLinux. Learn metrics collection, visualization, alerting, and best practices for production environments

20 min read
0 views
Table of Contents

In today’s complex IT environments, comprehensive monitoring is essential for maintaining system health, performance, and reliability. This guide will walk you through building a production-ready monitoring stack with Prometheus and Grafana on AlmaLinux, providing deep insights into your infrastructure.

Understanding Modern Monitoring

Modern monitoring goes beyond simple up/down checks. It encompasses:

  • Metrics Collection: Time-series data about system and application performance
  • Visualization: Interactive dashboards for data exploration
  • Alerting: Proactive notification of issues
  • Service Discovery: Automatic detection of monitoring targets
  • Long-term Storage: Historical data for trend analysis

Prerequisites

Before building your monitoring stack, ensure you have:

  • AlmaLinux 9 server (minimum 4 CPU cores, 8GB RAM)
  • Root or sudo access
  • Basic understanding of Linux system administration
  • Firewall configured (we’ll open necessary ports)
  • DNS configured (optional but recommended)

Architecture Overview

Our monitoring stack consists of:

  1. Prometheus: Metrics collection and storage
  2. Grafana: Visualization and dashboards
  3. Node Exporter: System metrics collection
  4. Alertmanager: Alert routing and management
  5. Additional Exporters: Application-specific metrics

Installing Prometheus

Setting Up Prometheus User and Directories

First, create a dedicated user and directory structure:

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Create directories
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus

# Set ownership
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Downloading and Installing Prometheus

# Download Prometheus
cd /tmp
PROMETHEUS_VERSION="2.45.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Extract archive
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz

# Copy binaries
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool /usr/local/bin/

# Copy console files
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/console_libraries /etc/prometheus

# Set ownership
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
sudo chown -R prometheus:prometheus /etc/prometheus/consoles
sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries

# Clean up
rm -rf prometheus-${PROMETHEUS_VERSION}.linux-amd64*

Configuring Prometheus

Create the main configuration file:

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'prometheus-stack'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Load rules once and periodically evaluate them
rule_files:
  - "alerts/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'

  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: 'monitoring-server'

  # Service discovery for node exporters
  - job_name: 'node-discovery'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/nodes/*.yml'
        refresh_interval: 30s

  # Blackbox exporter for endpoint monitoring
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115

Set permissions:

sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

Creating Systemd Service

Create a systemd service file:

# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --storage.tsdb.retention.time=30d \
    --storage.tsdb.retention.size=10GB \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.enable-lifecycle \
    --web.enable-admin-api

Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Start and enable Prometheus:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

Installing Node Exporter

Setting Up Node Exporter

Node Exporter provides system-level metrics:

# Create user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Download Node Exporter
cd /tmp
NODE_EXPORTER_VERSION="1.6.0"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Clean up
rm -rf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64*

Node Exporter Systemd Service

Create the service file:

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Documentation=https://prometheus.io/docs/guides/node-exporter/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
    --collector.systemd \
    --collector.processes \
    --collector.mountstats \
    --collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run|snap|var/lib/docker/.+)($|/) \
    --collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$

Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Installing Grafana

Adding Grafana Repository

# Add Grafana repository
cat <<EOF | sudo tee /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF

# Install Grafana
sudo dnf install -y grafana

Configuring Grafana

Edit the configuration file:

# /etc/grafana/grafana.ini
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
enable_gzip = true

[database]
type = sqlite3
path = grafana.db

[security]
admin_user = admin
admin_password = ChangeMeNow!
secret_key = your-secret-key-here
disable_gravatar = false
cookie_secure = false
cookie_samesite = lax
allow_embedding = false

[users]
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer

[auth.anonymous]
enabled = false

[auth.basic]
enabled = true

[smtp]
enabled = true
host = localhost:25
user = 
password = 
cert_file = 
key_file = 
skip_verify = false
from_address = [email protected]
from_name = Grafana

[alerting]
enabled = true
execute_alerts = true

[analytics]
reporting_enabled = false
check_for_updates = true

[log]
mode = console file
level = info

[metrics]
enabled = true
interval_seconds = 10

Starting Grafana

sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configuring Firewall

Open necessary ports:

# Open ports
sudo firewall-cmd --permanent --add-port=9090/tcp  # Prometheus
sudo firewall-cmd --permanent --add-port=9100/tcp  # Node Exporter
sudo firewall-cmd --permanent --add-port=3000/tcp  # Grafana
sudo firewall-cmd --permanent --add-port=9093/tcp  # Alertmanager

# Reload firewall
sudo firewall-cmd --reload

# Verify
sudo firewall-cmd --list-all

Setting Up Alerting

Installing Alertmanager

# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Create directories
sudo mkdir /etc/alertmanager
sudo mkdir /var/lib/alertmanager

# Download Alertmanager
cd /tmp
ALERTMANAGER_VERSION="0.25.0"
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool /usr/local/bin/

# Set ownership
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/amtool
sudo chown -R alertmanager:alertmanager /etc/alertmanager
sudo chown -R alertmanager:alertmanager /var/lib/alertmanager

# Clean up
rm -rf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64*

Configuring Alertmanager

Create the configuration:

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'localhost:25'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      continue: true
    - match:
        severity: warning
      receiver: 'warning'
    - match_re:
        service: database|cache
      receiver: 'database-team'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'

  - name: 'critical'
    email_configs:
      - to: '[email protected]'
    pagerduty_configs:
      - service_key: 'your-pagerduty-service-key'

  - name: 'warning'
    email_configs:
      - to: '[email protected]'
        send_resolved: false

  - name: 'database-team'
    email_configs:
      - to: '[email protected]'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alertmanager Service

# /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager/ \
    --cluster.advertise-address=0.0.0.0:9093

Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable and start:

sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Creating Alert Rules

System Alert Rules

Create alert rules directory:

sudo mkdir /etc/prometheus/alerts
sudo chown prometheus:prometheus /etc/prometheus/alerts

Create system alerts:

# /etc/prometheus/alerts/system_alerts.yml
groups:
  - name: system_alerts
    interval: 30s
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"

      # Disk space low
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 20% (current value: {{ $value }}%)"

      # Node down
      - alert: NodeDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes"

      # High load average
      - alert: HighLoadAverage
        expr: node_load5 > (count by (instance) (node_cpu_seconds_total{mode="idle"})) * 0.8
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High load average on {{ $labels.instance }}"
          description: "5-minute load average is high (current value: {{ $value }})"

Application Alert Rules

# /etc/prometheus/alerts/application_alerts.yml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # Service down
      - alert: ServiceDown
        expr: probe_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} has been unreachable for 5 minutes"

      # High response time
      - alert: HighResponseTime
        expr: probe_http_duration_seconds > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time for {{ $labels.instance }}"
          description: "Response time is above 2 seconds (current value: {{ $value }}s)"

      # SSL certificate expiry
      - alert: SSLCertificateExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "SSL certificate expires in {{ $value | humanizeDuration }}"

Configuring Grafana Dashboards

Adding Prometheus Data Source

  1. Log into Grafana (http://your-server:3000)
  2. Navigate to Configuration → Data Sources
  3. Click “Add data source”
  4. Select Prometheus
  5. Configure:

Creating System Dashboard

Create a comprehensive system monitoring dashboard:

{
  "dashboard": {
    "title": "System Monitoring",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "{{ instance }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Disk Usage",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"})) * 100",
            "legendFormat": "{{ instance }} - {{ mountpoint }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Network Traffic",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total[5m])",
            "legendFormat": "{{ instance }} - {{ device }} - RX"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total[5m])",
            "legendFormat": "{{ instance }} - {{ device }} - TX"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Importing Community Dashboards

Import popular dashboards:

  1. Node Exporter Full: Dashboard ID 1860
  2. Prometheus Stats: Dashboard ID 2
  3. Alertmanager: Dashboard ID 9578

To import:

  1. Go to Create → Import
  2. Enter dashboard ID
  3. Select Prometheus data source
  4. Click Import

Advanced Monitoring Configuration

Service Discovery

Configure automatic service discovery:

# /etc/prometheus/targets/nodes/webservers.yml
- targets:
    - 'web01.example.com:9100'
    - 'web02.example.com:9100'
    - 'web03.example.com:9100'
  labels:
    env: 'production'
    role: 'webserver'

Custom Exporters

Install additional exporters for specific services:

# MySQL Exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz
tar xvf mysqld_exporter-0.14.0.linux-amd64.tar.gz
sudo cp mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter /usr/local/bin/

# PostgreSQL Exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.13.0/postgres_exporter-0.13.0.linux-amd64.tar.gz
tar xvf postgres_exporter-0.13.0.linux-amd64.tar.gz
sudo cp postgres_exporter-0.13.0.linux-amd64/postgres_exporter /usr/local/bin/

# Redis Exporter
wget https://github.com/oliver006/redis_exporter/releases/download/v1.52.0/redis_exporter-v1.52.0.linux-amd64.tar.gz
tar xvf redis_exporter-v1.52.0.linux-amd64.tar.gz
sudo cp redis_exporter-v1.52.0.linux-amd64/redis_exporter /usr/local/bin/

Prometheus Federation

For large-scale deployments, configure Prometheus federation:

# /etc/prometheus/prometheus.yml - Global Prometheus
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"node|mysql|postgresql"}'
        - 'up'
    static_configs:
      - targets:
          - 'prometheus-dc1.example.com:9090'
          - 'prometheus-dc2.example.com:9090'

Performance Optimization

Prometheus Storage Configuration

Optimize storage for your workload:

# Adjust in prometheus.service
--storage.tsdb.retention.time=90d
--storage.tsdb.retention.size=100GB
--storage.tsdb.wal-compression
--query.max-samples=50000000
--query.timeout=2m

Grafana Performance

Optimize Grafana for large deployments:

# /etc/grafana/grafana.ini
[database]
type = postgres
host = localhost:5432
name = grafana
user = grafana
password = secure_password
ssl_mode = require
max_open_conn = 100
max_idle_conn = 50
conn_max_lifetime = 14400

[caching]
enabled = true

[dataproxy]
timeout = 30
keep_alive_seconds = 30

Security Hardening

Implementing Authentication

Prometheus Basic Auth

# Install httpd-tools for htpasswd
sudo dnf install -y httpd-tools

# Create password file
sudo htpasswd -c /etc/prometheus/.htpasswd admin

# Configure nginx reverse proxy
cat <<EOF | sudo tee /etc/nginx/conf.d/prometheus.conf
server {
    listen 443 ssl http2;
    server_name prometheus.example.com;

    ssl_certificate /etc/ssl/certs/prometheus.crt;
    ssl_certificate_key /etc/ssl/private/prometheus.key;

    location / {
        auth_basic "Prometheus";
        auth_basic_user_file /etc/prometheus/.htpasswd;
        proxy_pass http://localhost:9090;
        proxy_set_header Host \$host;
        proxy_set_header X-Real-IP \$remote_addr;
    }
}
EOF

Grafana LDAP Integration

# /etc/grafana/ldap.toml
[[servers]]
host = "ldap.example.com"
port = 389
use_ssl = false
start_tls = true
ssl_skip_verify = false
bind_dn = "cn=readonly,dc=example,dc=com"
bind_password = 'readonly_password'
search_filter = "(uid=%s)"
search_base_dns = ["ou=users,dc=example,dc=com"]

[servers.attributes]
name = "givenName"
surname = "sn"
username = "uid"
member_of = "memberOf"
email = "mail"

[[servers.group_mappings]]
group_dn = "cn=admins,ou=groups,dc=example,dc=com"
org_role = "Admin"

[[servers.group_mappings]]
group_dn = "cn=users,ou=groups,dc=example,dc=com"
org_role = "Editor"

Network Security

Implement proper network segmentation:

# Create monitoring network zone
sudo firewall-cmd --permanent --new-zone=monitoring
sudo firewall-cmd --permanent --zone=monitoring --add-source=10.0.0.0/24
sudo firewall-cmd --permanent --zone=monitoring --add-port=9090/tcp
sudo firewall-cmd --permanent --zone=monitoring --add-port=9100/tcp
sudo firewall-cmd --reload

Backup and Recovery

Automated Backup Script

Create backup script for Prometheus data:

#!/bin/bash
# /usr/local/bin/prometheus-backup.sh

BACKUP_DIR="/backup/prometheus"
PROMETHEUS_DATA="/var/lib/prometheus"
RETENTION_DAYS=30

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Stop Prometheus for consistent backup
systemctl stop prometheus

# Create backup
tar -czf ${BACKUP_DIR}/prometheus-backup-$(date +%Y%m%d-%H%M%S).tar.gz -C ${PROMETHEUS_DATA} .

# Start Prometheus
systemctl start prometheus

# Clean old backups
find ${BACKUP_DIR} -name "prometheus-backup-*.tar.gz" -mtime +${RETENTION_DAYS} -delete

# Backup Grafana
grafana-cli admin data-migration backup ${BACKUP_DIR}/grafana-backup-$(date +%Y%m%d-%H%M%S).tar.gz

Schedule with cron:

# Add to crontab
0 2 * * * /usr/local/bin/prometheus-backup.sh

Troubleshooting

Common Issues and Solutions

Prometheus Not Starting

# Check logs
journalctl -u prometheus -f

# Validate configuration
promtool check config /etc/prometheus/prometheus.yml

# Check permissions
ls -la /var/lib/prometheus/
ls -la /etc/prometheus/

High Memory Usage

# Reduce memory usage in prometheus.yml
global:
  scrape_interval: 30s  # Increase interval
  evaluation_interval: 30s

# Adjust retention
--storage.tsdb.retention.time=7d
--storage.tsdb.retention.size=10GB

Missing Metrics

# Check target status
curl http://localhost:9090/api/v1/targets

# Verify exporter is running
curl http://localhost:9100/metrics

# Check firewall
sudo firewall-cmd --list-all

Best Practices

  1. Monitoring Philosophy

    • Monitor symptoms, not causes
    • Alert on user-facing issues
    • Keep metrics focused and relevant
  2. Resource Planning

    • Plan for 2x expected growth
    • Monitor the monitoring system
    • Regular capacity reviews
  3. Alert Design

    • Avoid alert fatigue
    • Make alerts actionable
    • Include remediation steps
  4. Dashboard Design

    • Start with overview dashboards
    • Drill down to details
    • Use consistent color schemes
  5. Maintenance

    • Regular backups
    • Update components quarterly
    • Review and prune unused metrics

Conclusion

You now have a comprehensive monitoring stack running on AlmaLinux with Prometheus and Grafana. This setup provides deep visibility into your infrastructure, enabling proactive issue detection and data-driven decision making. Remember to continuously refine your monitoring as your infrastructure evolves, adding new metrics and alerts as needed while removing those that no longer provide value.

The combination of Prometheus’s powerful querying capabilities and Grafana’s beautiful visualizations creates a monitoring solution that scales from small deployments to large enterprise environments.