Issue #41: feat(infra): Deploy monitoring stack (Prometheus + Grafana + Loki)

State:

OPEN

Milestone:

Jalon 1: Sécurité & GDPR 🔒

Labels:

phase:vps,track:infrastructure priority:critical

Assignees:

Unassigned

Created:

2025-10-27

Updated:

2025-11-17

URL:

View on GitHub

Description

## Context

Current monitoring relies on **bash scripts + cron jobs** (`monitoring/scripts/*.sh`) which provide basic metrics but lack:
- Centralized metrics storage
- Real-time alerting
- Historical trend analysis
- Visual dashboards
- Log aggregation

Production deployment requires a proper observability stack for incident response and performance monitoring.

## Current Implementation (30% Complete)

**Existing monitoring scripts:**
- `monitoring/scripts/vps_metrics.sh` - RAM, CPU, disk, load average
- `monitoring/scripts/postgres_metrics.sh` - Slow queries, connections, cache hit ratio
- `monitoring/scripts/capacity_calculator.sh` - Database capacity estimation
- Cron jobs: health checks every 5 minutes

**Limitations:**
- No centralized storage (logs scattered)
- No alerting mechanism
- No dashboards
- Manual metric collection
- No log aggregation

## Objective

Deploy production-grade monitoring stack with:
1. **Prometheus** - Metrics collection & storage
2. **Grafana** - Dashboards & visualization
3. **Loki** - Log aggregation
4. **Alertmanager** - Alert routing & notification
5. **Node Exporter** - System metrics
6. **PostgreSQL Exporter** - Database metrics
7. **cAdvisor** - Container metrics

## Architecture

```
┌─────────────────┐
│   Grafana       │ ← Dashboard UI (port 3001)
│   (Dashboards)  │
└────────┬────────┘
         │
    ┌────┴────┬─────────┐
    │         │         │
┌───▼───┐ ┌──▼───┐ ┌───▼────┐
│Prom-  │ │ Loki │ │Alert-  │
│etheus │ │      │ │manager │
└───┬───┘ └──┬───┘ └────────┘
    │        │
┌───┴────────┴───────────┐
│  Exporters (scraping)   │
├─────────────────────────┤
│ - Node Exporter (VPS)   │
│ - PostgreSQL Exporter   │
│ - cAdvisor (containers) │
│ - Traefik metrics       │
│ - Application /metrics  │
└─────────────────────────┘
```

## Implementation Plan

### 1. Docker Compose Stack

**Create:** `monitoring/docker-compose.monitoring.yml`

```yaml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.45.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.0.3
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
      - GF_SERVER_ROOT_URL=https://${MONITORING_DOMAIN}
    ports:
      - "3001:3000"
    restart: unless-stopped

  loki:
    image: grafana/loki:2.9.0
    volumes:
      - loki_data:/loki
      - ./loki-config.yml:/etc/loki/local-config.yaml
    ports:
      - "3100:3100"
    restart: unless-stopped

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.26.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.6.1
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    restart: unless-stopped

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.13.2
    environment:
      DATA_SOURCE_NAME: "postgresql://koprogo:${POSTGRES_PASSWORD}@postgres:5432/koprogo_db?sslmode=disable"
    ports:
      - "9187:9187"
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    ports:
      - "8082:8080"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
```

### 2. Prometheus Configuration

**Create:** `monitoring/prometheus.yml`

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - '/etc/prometheus/alerts/*.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'traefik'
    static_configs:
      - targets: ['traefik:8080']

  - job_name: 'koprogo-backend'
    static_configs:
      - targets: ['backend:8080']
    metrics_path: '/metrics'
```

### 3. Alert Rules

**Create:** `monitoring/alerts/koprogo.yml`

```yaml
groups:
  - name: koprogo_alerts
    rules:
      # High CPU
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      # High Memory
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

      # Disk Space
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "{{ $value }}% remaining"

      # PostgreSQL
      - alert: PostgreSQLDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"

      - alert: PostgreSQLSlowQueries
        expr: rate(pg_stat_statements_mean_time_seconds[5m]) > 0.005
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL P99 latency > 5ms target"
          description: "Average query time: {{ $value }}s"

      # Container
      - alert: ContainerDown
        expr: up{job="cadvisor"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      # Backup
      - alert: BackupFailed
        expr: time() - koprogo_last_backup_timestamp_seconds > 86400
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Backup has not run in 24h"
```

### 4. Grafana Dashboards

**Create pre-configured dashboards:**
- `monitoring/grafana/dashboards/koprogo-overview.json` - System overview
- `monitoring/grafana/dashboards/postgres.json` - PostgreSQL metrics
- `monitoring/grafana/dashboards/docker.json` - Container metrics
- `monitoring/grafana/dashboards/traefik.json` - HTTP traffic

**Import community dashboards:**
- Node Exporter Full (ID: 1860)
- PostgreSQL Database (ID: 9628)
- Docker and System Monitoring (ID: 179)
- Traefik 2 (ID: 11462)

### 5. Alertmanager Configuration

**Create:** `monitoring/alertmanager.yml`

```yaml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs:
      - to: '${ALERT_EMAIL}'
        from: 'alertmanager@koprogo.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '${SMTP_USERNAME}'
        auth_password: '${SMTP_PASSWORD}'
        headers:
          Subject: '[KoproGo] {{ .GroupLabels.alertname }}'
```

### 6. Backend Metrics Endpoint

**Add to backend:**

`backend/src/infrastructure/web/metrics.rs` (new):
```rust
use actix_web::{get, HttpResponse};
use prometheus::{Encoder, TextEncoder, Registry};

#[get("/metrics")]
async fn metrics() -> HttpResponse {
    let encoder = TextEncoder::new();
    let metric_families = prometheus::gather();
    let mut buffer = vec![];
    encoder.encode(&metric_families, &mut buffer).unwrap();

    HttpResponse::Ok()
        .content_type("text/plain; version=0.0.4")
        .body(buffer)
}
```

Add `prometheus` crate to `Cargo.toml`:
```toml
prometheus = "0.13"
```

### 7. Ansible Deployment

**Update:** `infrastructure/ansible/playbook.yml`

```yaml
- name: Create monitoring directory
  file:
    path: /opt/koprogo/monitoring
    state: directory
    owner: koprogo
    mode: '0755'

- name: Copy monitoring configs
  template:
    src: "{{ item }}"
    dest: /opt/koprogo/monitoring/
  with_fileglob:
    - "monitoring/*.yml"

- name: Start monitoring stack
  command: docker-compose -f /opt/koprogo/monitoring/docker-compose.monitoring.yml up -d
```

## Testing & Validation

- [ ] All exporters scraping successfully (Prometheus targets page)
- [ ] Grafana dashboards loading with data
- [ ] Alerts firing correctly (test by triggering conditions)
- [ ] Logs visible in Loki
- [ ] Email alerts received
- [ ] Performance impact acceptable (<5% CPU overhead)
- [ ] Retention policy working (30d metrics, 7d logs)

## Security

- [ ] Grafana admin password strong (min 20 chars)
- [ ] Prometheus/Grafana not exposed publicly (localhost or VPN only)
- [ ] Traefik reverse proxy with authentication if exposed
- [ ] SMTP credentials in Ansible vault

## Documentation

- [ ] Update `monitoring/README.md` with access URLs
- [ ] Document alert thresholds and tuning
- [ ] Create runbook for common alerts
- [ ] Update CLAUDE.md with monitoring architecture

## Acceptance Criteria

- [ ] Prometheus scraping all metrics (VPS, PostgreSQL, containers, Traefik, backend)
- [ ] Grafana dashboards operational (4 pre-configured dashboards)
- [ ] Loki aggregating logs from containers and VPS
- [ ] Alertmanager sending email notifications
- [ ] Critical alerts configured (CPU, memory, disk, PostgreSQL, backups)
- [ ] Documentation complete
- [ ] Monitoring stack integrated with Ansible deployment

## Resource Requirements

**Estimated overhead:**
- Prometheus: ~200MB RAM, 10GB disk (30d retention)
- Grafana: ~100MB RAM
- Loki: ~150MB RAM, 5GB disk (7d retention)
- Exporters: ~50MB RAM total
- **Total: ~500MB RAM, 15GB disk**

**VPS impact:** Acceptable on 2GB VPS (25% overhead)

## Effort Estimate

**Large** (3-5 days)
- Day 1: Docker Compose stack + Prometheus
- Day 2: Grafana dashboards
- Day 3: Loki + log aggregation
- Day 4: Alert rules + testing
- Day 5: Documentation + refinement

## Related

- Supports: Issue #40 (backup monitoring)
- Supports: Performance optimization (P99 latency tracking)
- Enables: Production incident response

## References

- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Loki: https://grafana.com/docs/loki/
- Node Exporter: https://github.com/prometheus/node_exporter
- PostgreSQL Exporter: https://github.com/prometheus-community/postgres_exporter