📊 Installing Prometheus and Grafana on Alpine Linux: Complete Monitoring Guide
Let’s build a powerful monitoring and visualization stack with Prometheus and Grafana on Alpine Linux! 🚀 This comprehensive tutorial shows you how to set up complete infrastructure monitoring with metrics collection, alerting, and beautiful dashboards. Perfect for DevOps teams and system administrators! 😊
🤔 What are Prometheus and Grafana?
Prometheus is a powerful monitoring and alerting system that collects metrics from your infrastructure, while Grafana provides stunning visualizations and dashboards for your data!
This monitoring stack is like:
- 🔍 Smart surveillance systems that watch over your entire infrastructure
- 📈 Business intelligence dashboards that turn raw data into insights
- 🚨 Early warning systems that alert you before problems become critical
🎯 What You Need
Before we start, you need:
- ✅ Alpine Linux system with sufficient resources (4GB+ RAM recommended)
- ✅ Understanding of system monitoring concepts and metrics
- ✅ Basic knowledge of networking and service configuration
- ✅ Root access for system service installation
📋 Step 1: Install and Configure Prometheus
Install Prometheus Server
Let’s install Prometheus, the metrics collection and storage engine! 😊
What we’re doing: Installing Prometheus server for comprehensive metrics collection and monitoring.
# Update package list
apk update
# Install Prometheus server
apk add prometheus
# Install additional monitoring tools
apk add prometheus-node-exporter prometheus-alertmanager
# Check Prometheus version
prometheus --version
# Check installation paths
ls -la /etc/prometheus/
ls -la /var/lib/prometheus/
# Create Prometheus user and directories
adduser -D -s /bin/false prometheus 2>/dev/null || true
mkdir -p /var/lib/prometheus
mkdir -p /etc/prometheus/rules
mkdir -p /etc/prometheus/file_sd
chown -R prometheus:prometheus /var/lib/prometheus /etc/prometheus
# Start Prometheus service
rc-service prometheus start
# Enable Prometheus to start at boot
rc-update add prometheus default
# Test Prometheus web interface
echo "Prometheus should be available at: http://localhost:9090"
What this does: 📖 Installs Prometheus with all necessary components for monitoring.
Example output:
prometheus, version 2.45.0 (branch: HEAD, revision: 8b2f6b4)
build user: builduser@buildhost
build date: 20231124-14:56:23
go version: go1.21.5
What this means: Prometheus is installed and ready for configuration! ✅
Configure Prometheus for Production
Let’s create a comprehensive Prometheus configuration! 🎯
What we’re doing: Configuring Prometheus with targets, rules, and optimal settings for production monitoring.
# Backup original Prometheus configuration
cp /etc/prometheus/prometheus.yml /etc/prometheus/prometheus.yml.backup
# Create comprehensive Prometheus configuration
cat > /etc/prometheus/prometheus.yml << 'EOF'
# Prometheus Configuration for Complete Monitoring
global:
scrape_interval: 15s # How frequently to scrape targets
evaluation_interval: 15s # How frequently to evaluate rules
external_labels:
cluster: 'alpine-production'
environment: 'production'
# Alerting configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Rules configuration
rule_files:
- "/etc/prometheus/rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 5s
metrics_path: /metrics
# Node Exporter for system metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 10s
metrics_path: /metrics
# Alpine Linux specific monitoring
- job_name: 'alpine-system'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 15s
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_(cpu|memory|disk|network).*'
target_label: __name__
replacement: 'alpine_${1}'
# Application monitoring
- job_name: 'application-metrics'
file_sd_configs:
- files:
- '/etc/prometheus/file_sd/applications.yml'
scrape_interval: 30s
# Docker container monitoring (if Docker is installed)
- job_name: 'docker-containers'
static_configs:
- targets: ['localhost:9323']
scrape_interval: 30s
metrics_path: /metrics
# Custom service monitoring
- job_name: 'custom-services'
file_sd_configs:
- files:
- '/etc/prometheus/file_sd/services.yml'
scrape_interval: 60s
# Storage configuration
storage:
tsdb:
path: /var/lib/prometheus/data
retention.time: 30d
retention.size: 10GB
wal-compression: true
# Web configuration
web:
listen-address: '0.0.0.0:9090'
max-connections: 512
read-timeout: 30s
external-url: 'http://localhost:9090'
enable-lifecycle: true
enable-admin-api: true
EOF
# Create application discovery configuration
cat > /etc/prometheus/file_sd/applications.yml << 'EOF'
# Application Service Discovery
- targets:
- 'localhost:8080'
- 'localhost:8081'
labels:
service: 'web-application'
environment: 'production'
team: 'backend'
- targets:
- 'localhost:3000'
labels:
service: 'frontend-application'
environment: 'production'
team: 'frontend'
EOF
# Create services discovery configuration
cat > /etc/prometheus/file_sd/services.yml << 'EOF'
# Service Discovery for Custom Services
- targets:
- 'localhost:6379'
labels:
service: 'redis'
environment: 'production'
type: 'database'
- targets:
- 'localhost:11211'
labels:
service: 'memcached'
environment: 'production'
type: 'cache'
EOF
# Set proper ownership
chown -R prometheus:prometheus /etc/prometheus/
# Validate Prometheus configuration
promtool check config /etc/prometheus/prometheus.yml
# Restart Prometheus with new configuration
rc-service prometheus restart
echo "Prometheus configured for production monitoring! 📊"
What this creates: Production-ready Prometheus configuration with service discovery! ✅
Create Alerting Rules
Let’s set up intelligent alerting rules! 🚨
What we’re doing: Creating comprehensive alerting rules for system health, performance, and availability monitoring.
# Create alerting rules for system monitoring
cat > /etc/prometheus/rules/system-alerts.yml << 'EOF'
# System Alerting Rules for Alpine Linux
groups:
- name: system.rules
rules:
# High CPU usage alert
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
service: system
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
# High memory usage alert
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
service: system
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% on {{ $labels.instance }}"
# Low disk space alert
- alert: LowDiskSpace
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
for: 10m
labels:
severity: critical
service: system
annotations:
summary: "Low disk space warning"
description: "Disk usage is above 90% on {{ $labels.instance }} {{ $labels.mountpoint }}"
# System load alert
- alert: HighSystemLoad
expr: node_load15 > 2
for: 10m
labels:
severity: warning
service: system
annotations:
summary: "High system load detected"
description: "15-minute load average is {{ $value }} on {{ $labels.instance }}"
# Instance down alert
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
service: monitoring
annotations:
summary: "Instance is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute"
- name: application.rules
rules:
# Application response time alert
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
service: application
annotations:
summary: "High application response time"
description: "95th percentile response time is {{ $value }}s on {{ $labels.instance }}"
# Error rate alert
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
service: application
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% on {{ $labels.instance }}"
- name: infrastructure.rules
rules:
# Redis connection alert
- alert: RedisDown
expr: up{job="redis"} == 0
for: 1m
labels:
severity: critical
service: redis
annotations:
summary: "Redis is down"
description: "Redis service is not responding on {{ $labels.instance }}"
# Network connectivity alert
- alert: HighNetworkTraffic
expr: rate(node_network_receive_bytes_total[5m]) > 100000000 # 100MB/s
for: 10m
labels:
severity: warning
service: network
annotations:
summary: "High network traffic detected"
description: "Network receive traffic is {{ $value | humanize }}B/s on {{ $labels.instance }}"
EOF
# Create recording rules for performance optimization
cat > /etc/prometheus/rules/recording-rules.yml << 'EOF'
# Recording Rules for Performance Optimization
groups:
- name: performance.rules
interval: 30s
rules:
# CPU usage recording rule
- record: instance:cpu_utilization:rate5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
labels:
metric_type: performance
# Memory usage recording rule
- record: instance:memory_utilization:ratio
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
labels:
metric_type: performance
# Disk I/O recording rule
- record: instance:disk_io:rate5m
expr: rate(node_disk_io_time_seconds_total[5m])
labels:
metric_type: performance
# Network traffic recording rule
- record: instance:network_traffic:rate5m
expr: rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])
labels:
metric_type: performance
- name: application.recording
interval: 60s
rules:
# Request rate recording rule
- record: application:request_rate:rate5m
expr: rate(http_requests_total[5m])
labels:
metric_type: application
# Error rate recording rule
- record: application:error_rate:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
labels:
metric_type: application
EOF
# Validate alerting rules
promtool check rules /etc/prometheus/rules/*.yml
# Set proper ownership for rules
chown -R prometheus:prometheus /etc/prometheus/rules/
# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload
echo "Prometheus alerting rules configured! 🚨"
What this creates: Comprehensive alerting system for proactive monitoring! 🌟
🛠️ Step 2: Install Node Exporter
Configure Node Exporter for System Metrics
Let’s set up Node Exporter to collect detailed system metrics! 😊
What we’re doing: Installing and configuring Node Exporter for comprehensive system monitoring.
# Start Node Exporter service
rc-service prometheus-node-exporter start
# Enable Node Exporter to start at boot
rc-update add prometheus-node-exporter default
# Create Node Exporter configuration
cat > /etc/conf.d/prometheus-node-exporter << 'EOF'
# Node Exporter Configuration
NODE_EXPORTER_OPTS="--web.listen-address=0.0.0.0:9100 \
--path.procfs=/proc \
--path.sysfs=/sys \
--path.rootfs=/ \
--collector.filesystem.ignored-mount-points='^/(dev|proc|sys|var/lib/docker/.+)($|/)' \
--collector.filesystem.ignored-fs-types='^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$' \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
--collector.cpu \
--collector.diskstats \
--collector.filesystem \
--collector.loadavg \
--collector.meminfo \
--collector.netdev \
--collector.netstat \
--collector.stat \
--collector.time \
--collector.uname \
--collector.vmstat"
EOF
# Create textfile collector directory
mkdir -p /var/lib/node_exporter/textfile_collector
chown prometheus:prometheus /var/lib/node_exporter/textfile_collector
# Create custom metrics collection script
cat > /usr/local/bin/custom-metrics-collector.sh << 'EOF'
#!/bin/sh
# Custom Metrics Collector for Alpine Linux
TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
TEMP_FILE=$(mktemp)
# Collect Alpine package information
echo "# HELP alpine_packages_total Total number of installed packages" >> $TEMP_FILE
echo "# TYPE alpine_packages_total gauge" >> $TEMP_FILE
PACKAGE_COUNT=$(apk info | wc -l)
echo "alpine_packages_total $PACKAGE_COUNT" >> $TEMP_FILE
# Collect service status
echo "# HELP alpine_service_status Service status (1=running, 0=stopped)" >> $TEMP_FILE
echo "# TYPE alpine_service_status gauge" >> $TEMP_FILE
SERVICES="sshd chronyd syslog prometheus"
for service in $SERVICES; do
if rc-service $service status >/dev/null 2>&1; then
echo "alpine_service_status{service=\"$service\"} 1" >> $TEMP_FILE
else
echo "alpine_service_status{service=\"$service\"} 0" >> $TEMP_FILE
fi
done
# Collect system uptime in seconds
echo "# HELP alpine_uptime_seconds System uptime in seconds" >> $TEMP_FILE
echo "# TYPE alpine_uptime_seconds gauge" >> $TEMP_FILE
UPTIME=$(awk '{print $1}' /proc/uptime)
echo "alpine_uptime_seconds $UPTIME" >> $TEMP_FILE
# Collect temperature if available
if [ -r /sys/class/thermal/thermal_zone0/temp ]; then
echo "# HELP alpine_temperature_celsius CPU temperature in Celsius" >> $TEMP_FILE
echo "# TYPE alpine_temperature_celsius gauge" >> $TEMP_FILE
TEMP=$(cat /sys/class/thermal/thermal_zone0/temp)
TEMP_C=$(echo "$TEMP / 1000" | bc -l)
echo "alpine_temperature_celsius $TEMP_C" >> $TEMP_FILE
fi
# Atomically move the file to the textfile directory
mv $TEMP_FILE $TEXTFILE_DIR/custom_metrics.prom
EOF
chmod +x /usr/local/bin/custom-metrics-collector.sh
# Create cron job for custom metrics
echo "*/1 * * * * /usr/local/bin/custom-metrics-collector.sh" | crontab -u prometheus -
# Restart Node Exporter with new configuration
rc-service prometheus-node-exporter restart
# Test Node Exporter endpoint
echo "Testing Node Exporter metrics..."
curl -s http://localhost:9100/metrics | head -20
echo "Node Exporter configured for system monitoring! 📈"
What this does: Sets up comprehensive system metrics collection with custom Alpine-specific metrics! ✅
🎨 Step 3: Install and Configure Grafana
Install Grafana Visualization Platform
Let’s install Grafana for beautiful data visualization! 🎮
What we’re doing: Installing Grafana for creating stunning monitoring dashboards and visualizations.
# Install Grafana
apk add grafana
# Check Grafana version
grafana-server --version
# Create Grafana configuration directories
mkdir -p /etc/grafana/provisioning/{dashboards,datasources,notifiers}
mkdir -p /var/lib/grafana/{dashboards,plugins}
mkdir -p /var/log/grafana
# Set proper ownership
chown -R grafana:grafana /var/lib/grafana /var/log/grafana /etc/grafana
# Create Grafana configuration
cat > /etc/grafana/grafana.ini << 'EOF'
# Grafana Configuration for Alpine Linux Monitoring
[default]
# Instance name
instance_name = alpine-monitoring
[paths]
# Data directory
data = /var/lib/grafana
# Logs directory
logs = /var/log/grafana
# Plugins directory
plugins = /var/lib/grafana/plugins
# Provisioning directory
provisioning = /etc/grafana/provisioning
[server]
# Server settings
http_addr = 0.0.0.0
http_port = 3000
domain = localhost
root_url = http://localhost:3000/
serve_from_sub_path = false
router_logging = false
enable_gzip = true
[database]
# SQLite configuration for simplicity
type = sqlite3
host = 127.0.0.1:3306
name = grafana
user = root
password =
path = /var/lib/grafana/grafana.db
ssl_mode = disable
[session]
# Session configuration
provider = file
provider_config = sessions
cookie_name = grafana_sess
cookie_secure = false
session_life_time = 86400
[security]
# Security settings
admin_user = admin
admin_password = alpine_monitoring_2025
secret_key = alpine_grafana_secret_key_12345
login_remember_days = 7
cookie_username = grafana_user
cookie_remember_name = grafana_remember
disable_gravatar = true
[users]
# User management
allow_sign_up = false
allow_org_create = false
auto_assign_org = true
auto_assign_org_role = Viewer
default_theme = dark
[auth.anonymous]
# Anonymous access
enabled = false
[log]
# Logging configuration
mode = file
level = info
format = text
[log.file]
# File logging
log_rotate = true
max_lines = 1000000
max_size_shift = 28
daily_rotate = true
max_days = 7
[alerting]
# Alerting settings
enabled = true
execute_alerts = true
[unified_alerting]
# Unified alerting
enabled = true
[metrics]
# Metrics settings
enabled = true
interval_seconds = 10
[grafana_net]
url = https://grafana.net
[external_image_storage]
provider = local
[plugins]
# Plugin settings
enable_alpha = false
app_tls_skip_verify_insecure = false
EOF
# Start Grafana service
rc-service grafana start
# Enable Grafana to start at boot
rc-update add grafana default
echo "Grafana installed and configured! 🎨"
echo "Access Grafana at: http://localhost:3000"
echo "Default login: admin / alpine_monitoring_2025"
What this does: Installs and configures Grafana with optimal settings for Alpine Linux monitoring! 🌟
Configure Grafana Data Sources
Let’s set up Prometheus as a data source in Grafana! 🔧
What we’re doing: Configuring Grafana to connect to Prometheus and other data sources automatically.
# Create Prometheus data source configuration
cat > /etc/grafana/provisioning/datasources/prometheus.yml << 'EOF'
# Grafana Data Sources Configuration
apiVersion: 1
datasources:
# Primary Prometheus data source
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: true
jsonData:
httpMethod: POST
queryTimeout: 60s
timeInterval: 30s
exemplarTraceIdDestinations:
- name: traceID
url: http://localhost:16686/trace/${__value.raw}
secureJsonData: {}
# Node Exporter metrics (direct access)
- name: Node-Exporter
type: prometheus
access: proxy
url: http://localhost:9100
isDefault: false
editable: true
jsonData:
httpMethod: GET
queryTimeout: 30s
timeInterval: 15s
# TestData for examples and testing
- name: TestData
type: testdata
access: proxy
isDefault: false
editable: true
EOF
# Create dashboard provisioning configuration
cat > /etc/grafana/provisioning/dashboards/default.yml << 'EOF'
# Dashboard Provisioning Configuration
apiVersion: 1
providers:
# System monitoring dashboards
- name: 'alpine-system'
orgId: 1
folder: 'System Monitoring'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/system
# Application monitoring dashboards
- name: 'alpine-apps'
orgId: 1
folder: 'Applications'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/applications
# Infrastructure monitoring dashboards
- name: 'alpine-infrastructure'
orgId: 1
folder: 'Infrastructure'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/infrastructure
EOF
# Create dashboard directories
mkdir -p /var/lib/grafana/dashboards/{system,applications,infrastructure}
# Set proper ownership
chown -R grafana:grafana /etc/grafana/provisioning /var/lib/grafana/dashboards
# Restart Grafana to load new configuration
rc-service grafana restart
echo "Grafana data sources configured! 🔗"
What this creates: Automatic data source configuration for seamless monitoring! ✅
Create System Monitoring Dashboard
Let’s create a comprehensive system monitoring dashboard! 📊
What we’re doing: Creating a beautiful and functional dashboard for monitoring Alpine Linux system metrics.
# Create comprehensive system monitoring dashboard
cat > /var/lib/grafana/dashboards/system/alpine-system-overview.json << 'EOF'
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"vis": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"interval": "",
"legendFormat": "CPU Usage",
"refId": "A"
}
],
"title": "CPU Usage",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"vis": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"interval": "",
"legendFormat": "Memory Usage",
"refId": "A"
}
],
"title": "Memory Usage",
"type": "timeseries"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 70
},
{
"color": "red",
"value": 90
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"lastNotNull"
],
"fields": ""
},
"showThresholdLabels": false,
"showThresholdMarkers": true,
"text": {}
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "(node_filesystem_size_bytes{mountpoint=\"/\"} - node_filesystem_free_bytes{mountpoint=\"/\"}) / node_filesystem_size_bytes{mountpoint=\"/\"} * 100",
"interval": "",
"legendFormat": "Root Disk Usage",
"refId": "A"
}
],
"title": "Disk Usage",
"type": "gauge"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 1
},
{
"color": "red",
"value": 2
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 6,
"y": 8
},
"id": 4,
"options": {
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": [
"lastNotNull"
],
"fields": ""
},
"showThresholdLabels": false,
"showThresholdMarkers": true,
"text": {}
},
"pluginVersion": "8.0.0",
"targets": [
{
"expr": "node_load15",
"interval": "",
"legendFormat": "Load Average",
"refId": "A"
}
],
"title": "System Load",
"type": "gauge"
},
{
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"vis": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "binBps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 5,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total[5m])",
"interval": "",
"legendFormat": "{{device}} - Receive",
"refId": "A"
},
{
"expr": "rate(node_network_transmit_bytes_total[5m])",
"interval": "",
"legendFormat": "{{device}} - Transmit",
"refId": "B"
}
],
"title": "Network Traffic",
"type": "timeseries"
}
],
"schemaVersion": 30,
"style": "dark",
"tags": ["alpine", "system", "monitoring"],
"templating": {
"list": []
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Alpine Linux System Overview",
"uid": "alpine-system-overview",
"version": 1
}
EOF
# Set proper ownership
chown -R grafana:grafana /var/lib/grafana/dashboards/
# Restart Grafana to load dashboards
rc-service grafana restart
echo "System monitoring dashboard created! 📊"
What this creates: Beautiful system monitoring dashboard with key Alpine Linux metrics! 🌟
📊 Quick Monitoring Commands Table
Command | Purpose | Result |
---|---|---|
🔧 promtool query instant 'up' | Check target status | ✅ Service availability |
🔍 curl localhost:9090/api/v1/targets | View Prometheus targets | ✅ Monitoring endpoints |
🚀 grafana-cli admin reset-admin-password admin | Reset Grafana password | ✅ Access recovery |
📋 curl localhost:9100/metrics | grep cpu | View Node Exporter CPU metrics | ✅ System metrics |
🎮 Practice Time!
Let’s practice what you learned! Try these monitoring scenarios:
Example 1: Application Performance Monitoring 🟢
What we’re doing: Setting up comprehensive application performance monitoring with custom metrics and alerting.
# Create application monitoring setup
mkdir -p /opt/app-monitoring
cd /opt/app-monitoring
# Create sample application with metrics endpoint
cat > app-metrics-server.py << 'EOF'
#!/usr/bin/env python3
"""
Sample Application with Prometheus Metrics
"""
import time
import random
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import urlparse
class MetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
path = urlparse(self.path).path
if path == '/metrics':
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
# Generate sample metrics
cpu_usage = random.uniform(10, 90)
memory_usage = random.uniform(30, 80)
request_count = random.randint(100, 1000)
response_time = random.uniform(0.1, 2.0)
metrics = f"""# HELP app_cpu_usage_percent Application CPU usage
# TYPE app_cpu_usage_percent gauge
app_cpu_usage_percent {cpu_usage:.2f}
# HELP app_memory_usage_percent Application memory usage
# TYPE app_memory_usage_percent gauge
app_memory_usage_percent {memory_usage:.2f}
# HELP app_requests_total Total application requests
# TYPE app_requests_total counter
app_requests_total {request_count}
# HELP app_response_time_seconds Application response time
# TYPE app_response_time_seconds histogram
app_response_time_seconds_bucket{{le="0.1"}} {random.randint(10, 50)}
app_response_time_seconds_bucket{{le="0.5"}} {random.randint(50, 150)}
app_response_time_seconds_bucket{{le="1.0"}} {random.randint(150, 300)}
app_response_time_seconds_bucket{{le="2.0"}} {random.randint(300, 500)}
app_response_time_seconds_bucket{{le="+Inf"}} {random.randint(500, 600)}
app_response_time_seconds_sum {response_time * request_count:.2f}
app_response_time_seconds_count {request_count}
# HELP app_errors_total Total application errors
# TYPE app_errors_total counter
app_errors_total {random.randint(0, 50)}
# HELP app_uptime_seconds Application uptime
# TYPE app_uptime_seconds gauge
app_uptime_seconds {time.time()}
"""
self.wfile.write(metrics.encode())
elif path == '/health':
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(b'OK')
else:
self.send_response(404)
self.end_headers()
self.wfile.write(b'Not Found')
if __name__ == '__main__':
server = HTTPServer(('localhost', 8080), MetricsHandler)
print("Application metrics server running on http://localhost:8080/metrics")
server.serve_forever()
EOF
# Install Python if not available
apk add python3
# Make the script executable
chmod +x app-metrics-server.py
# Start the application in background
python3 app-metrics-server.py &
APP_PID=$!
# Update Prometheus to scrape this application
cat >> /etc/prometheus/prometheus.yml << 'EOF'
# Sample application monitoring
- job_name: 'sample-application'
static_configs:
- targets: ['localhost:8080']
scrape_interval: 5s
metrics_path: /metrics
EOF
# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload
# Create application dashboard
cat > /var/lib/grafana/dashboards/applications/application-performance.json << 'EOF'
{
"dashboard": {
"title": "Application Performance Monitoring",
"panels": [
{
"title": "Application CPU Usage",
"type": "stat",
"targets": [
{
"expr": "app_cpu_usage_percent",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "Application Memory Usage",
"type": "stat",
"targets": [
{
"expr": "app_memory_usage_percent",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(app_requests_total[5m])",
"legendFormat": "Requests/sec"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(app_response_time_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
}
]
}
}
EOF
echo "Application performance monitoring configured! 🎯"
echo "Check metrics at: http://localhost:8080/metrics"
echo "Application PID: $APP_PID (kill with: kill $APP_PID)"
What this does: Shows you how to monitor application performance with custom metrics! 🎯
Example 2: Infrastructure Alerting System 🟡
What we’re doing: Creating a comprehensive alerting system with multiple notification channels.
# Create advanced alerting configuration
mkdir -p /opt/alerting-system
cd /opt/alerting-system
# Install Alertmanager
apk add prometheus-alertmanager
# Create Alertmanager configuration
cat > /etc/prometheus/alertmanager.yml << 'EOF'
# Alertmanager Configuration for Infrastructure Monitoring
global:
smtp_smarthost: 'localhost:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
templates:
- '/etc/prometheus/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
# Critical alerts go to immediate notification
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 5s
repeat_interval: 5m
# Warning alerts go to standard notification
- match:
severity: warning
receiver: 'warning-alerts'
repeat_interval: 30m
# System alerts
- match:
service: system
receiver: 'system-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://localhost:9093/webhook'
send_resolved: true
- name: 'critical-alerts'
email_configs:
- to: '[email protected]'
subject: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Time: {{ .StartsAt }}
{{ end }}
webhook_configs:
- url: 'http://localhost:9093/webhook/critical'
send_resolved: true
- name: 'warning-alerts'
email_configs:
- to: '[email protected]'
subject: '⚠️ WARNING: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
{{ end }}
- name: 'system-alerts'
webhook_configs:
- url: 'http://localhost:9093/webhook/system'
send_resolved: true
inhibit_rules:
# Inhibit warning alerts if critical alerts are firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
# Create alert notification webhook receiver
cat > alert-webhook-receiver.py << 'EOF'
#!/usr/bin/env python3
"""
Alert Webhook Receiver for Custom Notifications
"""
import json
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import urlparse
class AlertWebhookHandler(BaseHTTPRequestHandler):
def do_POST(self):
path = urlparse(self.path).path
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
try:
alert_data = json.loads(post_data.decode('utf-8'))
self.process_alert(alert_data, path)
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(b'{"status": "ok"}')
except Exception as e:
print(f"Error processing alert: {e}")
self.send_response(500)
self.end_headers()
def process_alert(self, alert_data, path):
timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
print(f"\n{'='*50}")
print(f"ALERT RECEIVED - {timestamp}")
print(f"Webhook Path: {path}")
print(f"{'='*50}")
for alert in alert_data.get('alerts', []):
status = alert.get('status', 'unknown')
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
print(f"Status: {status}")
print(f"Alert: {labels.get('alertname', 'Unknown')}")
print(f"Severity: {labels.get('severity', 'Unknown')}")
print(f"Instance: {labels.get('instance', 'Unknown')}")
print(f"Summary: {annotations.get('summary', 'No summary')}")
print(f"Description: {annotations.get('description', 'No description')}")
if status == 'firing':
print("🚨 ALERT IS FIRING!")
self.log_to_file(alert, 'FIRING')
elif status == 'resolved':
print("✅ ALERT RESOLVED")
self.log_to_file(alert, 'RESOLVED')
print("-" * 30)
def log_to_file(self, alert, status):
timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
log_entry = {
'timestamp': timestamp,
'status': status,
'alertname': labels.get('alertname'),
'severity': labels.get('severity'),
'instance': labels.get('instance'),
'summary': annotations.get('summary'),
'description': annotations.get('description')
}
with open('/var/log/alerts.log', 'a') as f:
f.write(json.dumps(log_entry) + '\n')
if __name__ == '__main__':
server = HTTPServer(('localhost', 9093), AlertWebhookHandler)
print("Alert webhook receiver running on http://localhost:9093/webhook")
print("Logs will be written to /var/log/alerts.log")
server.serve_forever()
EOF
chmod +x alert-webhook-receiver.py
# Start Alertmanager
rc-service prometheus-alertmanager start
rc-update add prometheus-alertmanager default
# Start alert webhook receiver in background
python3 alert-webhook-receiver.py &
WEBHOOK_PID=$!
# Create alert testing script
cat > test-alerts.sh << 'EOF'
#!/bin/sh
echo "🧪 Testing Alert System"
# Test firing an alert
echo "Sending test alert..."
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"instance": "localhost:9090",
"service": "test"
},
"annotations": {
"summary": "Test alert for monitoring system",
"description": "This is a test alert to verify the alerting system is working correctly"
},
"startsAt": "'$(date -Iseconds)'"
}
]'
echo "Alert sent! Check webhook receiver output and /var/log/alerts.log"
EOF
chmod +x test-alerts.sh
echo "Advanced alerting system configured! 🚨"
echo "Webhook receiver PID: $WEBHOOK_PID (kill with: kill $WEBHOOK_PID)"
echo "Test alerts with: ./test-alerts.sh"
What this does: Demonstrates comprehensive alerting with custom notification handling! 🚨
🚨 Fix Common Problems
Problem 1: Prometheus targets down ❌
What happened: Prometheus cannot scrape metrics from targets. How to fix it: Check network connectivity and service configuration.
# Check Prometheus targets status
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, lastError: .lastError}'
# Check service status
rc-service prometheus status
rc-service prometheus-node-exporter status
# Check network connectivity
netstat -tulpn | grep -E "(9090|9100)"
# Restart services if needed
rc-service prometheus restart
rc-service prometheus-node-exporter restart
Problem 2: Grafana dashboard not loading data ❌
What happened: Grafana cannot connect to Prometheus or display metrics. How to fix it: Verify data source configuration and queries.
# Test Prometheus connection from Grafana host
curl -s http://localhost:9090/api/v1/query?query=up
# Check Grafana logs
tail -f /var/log/grafana/grafana.log
# Restart Grafana service
rc-service grafana restart
# Test data source connectivity in Grafana UI
echo "Visit http://localhost:3000/datasources and test connections"
Don’t worry! Monitoring systems require fine-tuning - check connectivity and configurations systematically! 💪
💡 Simple Tips
- Start with basic metrics 📅 - Begin with CPU, memory, disk, and network monitoring
- Set meaningful alert thresholds 🌱 - Avoid alert fatigue with appropriate limits
- Create actionable dashboards 🤝 - Focus on metrics that help with decision making
- Regular maintenance 💪 - Monitor data retention and clean up old metrics
✅ Check Everything Works
Let’s verify your monitoring stack is working perfectly:
# Complete monitoring system verification
cat > /usr/local/bin/monitoring-stack-check.sh << 'EOF'
#!/bin/sh
echo "=== Monitoring Stack System Check ==="
echo "1. Prometheus Server:"
if curl -s http://localhost:9090/api/v1/query?query=up >/dev/null; then
echo "✅ Prometheus is running and responding"
prometheus_version=$(curl -s http://localhost:9090/api/v1/status/buildinfo | jq -r '.data.version')
echo "Version: $prometheus_version"
targets_up=$(curl -s http://localhost:9090/api/v1/query?query=up | jq '.data.result | length')
echo "Active targets: $targets_up"
else
echo "❌ Prometheus is not responding"
fi
echo -e "\n2. Node Exporter:"
if curl -s http://localhost:9100/metrics >/dev/null; then
echo "✅ Node Exporter is running"
metrics_count=$(curl -s http://localhost:9100/metrics | wc -l)
echo "Metrics available: $metrics_count"
else
echo "❌ Node Exporter is not responding"
fi
echo -e "\n3. Grafana:"
if curl -s http://localhost:3000/api/health >/dev/null; then
echo "✅ Grafana is running"
grafana_version=$(curl -s http://localhost:3000/api/health | jq -r '.version')
echo "Version: $grafana_version"
else
echo "❌ Grafana is not responding"
fi
echo -e "\n4. Alertmanager:"
if curl -s http://localhost:9093/api/v1/status >/dev/null; then
echo "✅ Alertmanager is running"
alertmanager_version=$(curl -s http://localhost:9093/api/v1/status | jq -r '.data.versionInfo.version')
echo "Version: $alertmanager_version"
else
echo "❌ Alertmanager is not responding"
fi
echo -e "\n5. Data Collection Test:"
echo "Testing metric collection..."
# Test CPU metric
cpu_metric=$(curl -s "http://localhost:9090/api/v1/query?query=100-(avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m]))*100)" | jq -r '.data.result[0].value[1]')
if [ "$cpu_metric" != "null" ]; then
echo "✅ CPU metrics: ${cpu_metric}%"
else
echo "❌ CPU metrics not available"
fi
# Test memory metric
memory_metric=$(curl -s "http://localhost:9090/api/v1/query?query=(node_memory_MemTotal_bytes-node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes*100" | jq -r '.data.result[0].value[1]')
if [ "$memory_metric" != "null" ]; then
echo "✅ Memory metrics: ${memory_metric}%"
else
echo "❌ Memory metrics not available"
fi
echo -e "\n6. Alert Rules:"
rules_count=$(curl -s http://localhost:9090/api/v1/rules | jq '.data.groups | length')
echo "Loaded rule groups: $rules_count"
echo -e "\n7. Dashboard Status:"
if [ -f "/var/lib/grafana/dashboards/system/alpine-system-overview.json" ]; then
echo "✅ System dashboard available"
else
echo "❌ System dashboard missing"
fi
echo -e "\n8. Access URLs:"
echo "🌐 Prometheus: http://localhost:9090"
echo "🌐 Grafana: http://localhost:3000 (admin/alpine_monitoring_2025)"
echo "🌐 Alertmanager: http://localhost:9093"
echo "🌐 Node Exporter: http://localhost:9100/metrics"
echo -e "\nMonitoring stack operational! ✅"
EOF
chmod +x /usr/local/bin/monitoring-stack-check.sh
/usr/local/bin/monitoring-stack-check.sh
Good output shows:
=== Monitoring Stack System Check ===
1. Prometheus Server:
✅ Prometheus is running and responding
Version: 2.45.0
Active targets: 3
2. Node Exporter:
✅ Node Exporter is running
Metrics available: 1247
3. Grafana:
✅ Grafana is running
Version: 9.5.2
Monitoring stack operational! ✅
🏆 What You Learned
Great job! Now you can:
- ✅ Install and configure Prometheus for comprehensive metrics collection
- ✅ Set up Node Exporter for detailed system monitoring
- ✅ Configure Grafana for beautiful data visualization
- ✅ Create custom dashboards and monitoring workflows
- ✅ Implement intelligent alerting with Alertmanager
- ✅ Set up service discovery and target management
- ✅ Create application performance monitoring solutions
- ✅ Build comprehensive infrastructure monitoring stacks
- ✅ Troubleshoot common monitoring issues and optimize performance
🎯 What’s Next?
Now you can try:
- 📚 Setting up distributed monitoring with multiple Prometheus instances
- 🛠️ Implementing custom exporters for specific applications
- 🤝 Integrating with external alerting systems (Slack, PagerDuty)
- 🌟 Exploring advanced Grafana features like annotations and variables!
Remember: Effective monitoring is the foundation of reliable systems! You’re now building world-class observability on Alpine Linux! 🎉
Keep monitoring and you’ll master infrastructure observability! 💫