hardware infra document

工作紀錄

System Architecture

Our infrastructure is built on a three-node Proxmox cluster. The cluster provides the main virtualization platform for project services and internal virtual machines. In addition to the Proxmox cluster, we deploy dedicated infrastructure services for shared storage, backup, logging, and monitoring.

Core Components

Component

Hostname

IP

Role

Proxmox Node 1

pve1

172.16.127.16

Proxmox cluster node

Proxmox Node 2

pve2

172.16.127.17

Proxmox cluster node; hosts the PBS VM

Proxmox Node 3

pve3

172.16.127.18

Proxmox cluster node

Proxmox Backup Server

backup

172.16.127.112

Backup service for Proxmox VMs

Logging / Monitoring Server

logging

172.16.127.115

Grafana, Loki, and Prometheus

High-Level Design

The infrastructure is divided into four main layers:

  1. Virtualization Layer

    The virtualization layer is provided by a three-node Proxmox cluster consisting of pve1, pve2, and pve3. These nodes are joined into the same Proxmox cluster and are managed under the same Proxmox datacenter view.

  2. Shared Storage Layer

    Ceph is officially enabled on the Proxmox cluster. An RBD storage pool is created and used as shared storage for VM disks. This allows VM storage to be shared across the Proxmox nodes.

  3. Backup Layer

    Proxmox Backup Server is deployed as a VM on pve2. The PBS VM is used to store and manage VM backups from the Proxmox cluster.

    Currently, the PBS VM is fixed on pve2 and is not configured with High Availability. If pve2 goes down, the backup service will not automatically migrate to another node.

  4. Logging and Monitoring Layer

    Grafana, Loki, and Prometheus are deployed on the logging server.

    • Loki is used to store logs.

    • Grafana is used to visualize logs and metrics.

    • Prometheus is used to collect metrics.

    • Grafana Alloy is installed on Proxmox nodes to collect systemd journal logs and forward them to Loki.

    • node-exporter is installed on Proxmox nodes to expose host metrics to Prometheus.

Architecture Diagram

                           ┌─────────────────────────────┐
                           │        Proxmox Cluster      │
                           │                             │
                           │  ┌──────┐ ┌──────┐ ┌──────┐ │
                           │  │ pve1 │ │ pve2 │ │ pve3 │ │
                           │  └──────┘ └──────┘ └──────┘ │
                           └──────────────┬──────────────┘
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    │                     │                     │
                    v                     v                     v
          ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────────┐
          │   Ceph RBD      │   │ Proxmox Backup  │   │ Logging / Monitoring│
          │ Shared Storage  │   │ Server VM       │   │ Server              │
          │                 │   │                 │   │                     │
          │ VM disk storage │   │ runs on pve2    │   │ Grafana             │
          └─────────────────┘   │ no HA enabled   │   │ Loki                │
                                └─────────────────┘   │ Prometheus          │
                                                      └─────────────────────┘

Known issue

  • 有些機器不能自然重開機要過bios,推測前人沒刪乾淨 (所以sc組的若重開proxmox有機會需要去機房開QQ)

Network Design

This infrastructure mainly uses an internal private network for Proxmox nodes, physical machines, and virtual machines. Public IP addresses are only used where external access is required.

Internal Network

The internal network uses the following subnet:

Item

Value

Internal subnet

172.16.0.0/16

Gateway

172.16.0.1

Project usable range

172.16.127.0 - 172.16.127.200

The project mainly uses the 172.16.127.0/24 range for infrastructure machines, Proxmox nodes, and virtual machines. IP addresses after 172.16.127.200 are reserved for the SC machines.

IP Allocation

Range

Usage

172.16.127.11 - 172.16.127.15

SC physical machines

172.16.127.16 - 172.16.127.18

Proxmox cluster nodes

172.16.127.101 and above

Project VMs and infrastructure services

172.16.127.112

Proxmox Backup Server

172.16.127.115

Logging / monitoring server

Important Infrastructure IPs

Name

Hostname

IP

Usage

SC1

sc1

172.16.127.11

SC physical machine

SC2

sc2

172.16.127.12

SC physical machine

SC3

sc3

172.16.127.13

SC physical machine

SC4

sc4

172.16.127.14

SC physical machine

SC5

sc5

172.16.127.15

SC physical machine

RM1

pve1

172.16.127.16

Proxmox node

RM2

pve2

172.16.127.17

Proxmox node

RM3

pve3

172.16.127.18

Proxmox node

PBS

backup

172.16.127.112

Backup server

Logging

logging

172.16.127.115

Grafana, Loki, Prometheus

Public Network

The public IP range is available through VLAN tagging.

Item

Value

Public IP range

140.112.187.48-55/27

Public gateway

140.112.187.62

VLAN tag

187

Currently, public IP addresses are only used by the firewall. Most internal services and VMs use the private 172.16.0.0/16 network.

how to configure pve’s IP

ip a
vi /etc/network/interfaces
systemctl restart networking
ip link set up ...
  • /etc/network/interfaces

iface nic1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 172.16.127.12/16
    gateway 172.16.0.1
    bridge-ports nic1
    bridge-stp off
    bridge-fd 0

changing hostname

vi /etc/hosts ## sc1.nasa sc1
vi /etc/hostname ## sc1
reboot

VM Network Design

All VMs are connected through the Proxmox Linux bridge vmbr0

VM network interface
        │
        v
Proxmox vmbr0
        │
        v
Physical network interface
        │
        v
Internal network: 172.16.0.0/16

how to configure VM statis IP

A typical VM network configuration uses a static internal IP address, the internal gateway, and public DNS resolvers.

ubuntu

modify /etc/netplan/*.yaml

network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses:
        - 172.16.127.101/16
      routes:
        - to: default
          via: 172.16.0.1
      nameservers:
        addresses:
          - 8.8.8.8
          - 1.1.1.1
sudo netplan apply

arch linux

sudo nmcli connection modify "Wired connection 1" \
  ipv4.addresses 172.16.127.105/24 \
  ipv4.gateway 172.16.127.1 \
  ipv4.dns "8.8.8.8 1.1.1.1" \
  ipv4.method manual

windows(and installation guide)

  1. installwindows isoanddriver iso

  2. 再進create VM的時候記得裝網路驅動 如果沒裝可以在進windows之後再裝

  3. 跳過連上網路步驟:shift+F10叫出cmd然後輸入OOBE\BYPASSNRO重啟

  4. 去設定, ethernet, IP assignment, 設定Manual, 打開IPv4

    IP: 172.16.127.106
    subnet mask: 255.255.0.0
    gateway: 172.16.0.1
    DNS: 8.8.8.8
    

VMs network and service

IP

hostname

username

os

172.16.127.101

room1-nasa3

room1

ubuntu

172.16.127.102

mail1-nasa3

mail1

ubuntu

172.16.127.103

printu1-nasa3

printu1

ubuntu

172.16.127.104

database1-nasa3

database1

ubuntu

172.16.127.105

iden1-nasa3

iden1

arch

172.16.127.106

printw1-nasa3

printw1

windows

172.16.127.107

room2-nasa3

room2

ubuntu

172.16.127.108

room3-nasa3

room3

ubuntu

172.16.127.109

iden2-nasa3

iden2

arch

172.16.127.110

wifi1-nasa3

wifi1

ubuntu

172.16.127.112

backup

root

proxmox-backup-server

172.16.127.113

uiux1

uiux1

ubuntu

172.16.127.114

uiux2

uiux2

ubuntu

172.16.127.115

log

logging

ubuntu

172.16.127.116

mail2

mail2

ubuntu

172.16.127.117

mail3

mail3

ubuntu

Known issue

  • USB NIC Instability since the bios is in old version, some SC machines use USB network adapters(sc1, sc2, sc4). One known issue is that the USB NIC on sc1 (172.16.127.11) may automatically disconnect after being connected for a short time.

To mitigate this issue, USB autosuspend should be disabled through GRUB.

Edit /etc/default/grub and add: usbcore.autosuspend=-1 eg.

GRUB_CMDLINE_LINUX_DEFAULT="quiet usbcore.autosuspend=-1"

and then update grub

sudo update-grub
  • 有一些人有假的enx...,但我們沒有把假的介面刪掉

Proxmox Cluster Setup

The virtualization platform is built as a three-node Proxmox cluster. The cluster was initially created on pve1, and pve2 and pve3 were later joined into the same cluster.

The cluster allows all Proxmox nodes to be managed under the same datacenter view and provides the base infrastructure for VM management, shared storage, backup, and monitoring.

Cluster Overview

Node

Hostname

IP

Role

RM1

pve1

172.16.127.16

Proxmox cluster node; initial cluster creation node

RM2

pve2

172.16.127.17

Proxmox cluster node

RM3

pve3

172.16.127.18

Proxmox cluster node

All three nodes are connected through the internal network 172.16.0.0/16.
The Proxmox cluster communication uses the network interface configured on the Proxmox nodes.

Cluster Creation

The cluster was created from pve1.

After the cluster was created, the other two Proxmox nodes were joined into the cluster:

pve1 creates the cluster
pve2 joins the cluster
pve3 joins the cluster

commands

pvecm create rm-nasa
pvecm join 172.16.127.16 

Cluster Purpose

The Proxmox cluster is used to:

  • manage all Proxmox nodes from a single datacenter view

  • host project virtual machines

  • provide the base environment for Ceph shared storage

  • integrate with Proxmox Backup Server

  • provide the foundation for HA and VM migration

HA and VM migration depend on shared storage and are documented in the shared storage section.

Note

  • USB network adapters should not be used for Proxmox cluster communication because they may be unstable. In this setup, USB NIC issues are limited to SC machines and are documented in the Network Design section.

  • sou can use pvecm status to verify cluster’s health

Known Issue

  • Cluster Reinstallation

During earlier setup, the machine using 172.16.127.17 was replaced. At that time, old cluster information could not be fully removed from pve1 and pve3, so all three Proxmox nodes were reinstalled and the cluster was rebuilt from scratch.

This issue is important for future maintenance:

avoid changing cluster node identity casually keep hostname and IP address consistent remove a node cleanly before replacing it verify cluster state before rejoining a replaced node if old cluster metadata cannot be removed cleanly, reinstallation may be required

Share storage and HA

Ceph is used as the shared storage backend for the Proxmox cluster.
All three Proxmox nodes participate in the Ceph cluster, and each Proxmox node provides one Ceph OSD.

A Ceph RBD pool named vm-data is used as shared VM disk storage.

working history

setup flow

1. Enable Proxmox no-subscription repository
2. Install Ceph on all Proxmox nodes
3. Create Ceph monitors
4. Prepare one OSD disk on each Proxmox node
5. Create OSDs
6. Create the Ceph pool vm-data
7. Set pool size=3 and min_size=2
8. Add vm-data to Proxmox as RBD storage
9. Move VM disks to vm-data when HA/migration is required
10. Enable HA for VMs stored on vm-data
11. Test VM migration and HA failover

install ceph

Before installing Ceph, the Proxmox no-subscription repositories should be enabled.

Example repository configuration:

mv /etc/apt/sources.list.d/pve-enterprise.sources /etc/apt/sources.list.d/pve-enterprise.sources.bak
mv /etc/apt/sources.list.d/ceph.sources /etc/apt/sources.list.d/ceph.sources.bak
cat >/etc/apt/sources.list.d/proxmox.sources <<'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription

Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
EOF
apt update
apt upgrade

When installing Ceph from the Proxmox web UI, choose the no-subscription repository. then install monitor in webgui.Then install osd

instal OSD

Before creating an OSD, the selected disk should be cleaned.

wipefs -a /dev/nvme1n1
sgdisk --zap-all /dev/nvme1n1
partprobe /dev/nvme1n1

ceph->OSD->create OSD datacenter->pool->create poolvreate a pool and check that on datacenter->storage

Purpose

The shared storage layer is used to:

  • provide shared VM disk storage across pve1, pve2, and pve3

  • allow VMs to be migrated between Proxmox nodes

  • support Proxmox HA for selected VMs

  • avoid tying VM disks to only one physical host

Current Ceph Configuration

Item

Value

Ceph nodes

pve1, pve2, pve3

OSD layout

one OSD per Proxmox node

RBD pool name

vm-data

Pool size

3

Pool min_size

2

Usage

shared VM disk storage

The pool uses size=3, which means Ceph keeps three replicated copies of data.
The min_size=2 setting means the pool can continue operating as long as at least two replicas are available.

VM Storage Policy

Most VM disks are stored on the Ceph RBD pool vm-data.

Exceptions:

VM / Service

Storage Policy

Reason

Proxmox Backup Server

Not stored on RBD

PBS is fixed on pve2 and uses a dedicated disk/datastore

Firewall

Not stored on RBD

Firewall storage is managed separately

Other VMs

Stored on RBD

Supports migration and HA

HA Policy

HA is enabled for VMs whose disks are stored on shared RBD storage.

In this setup:

  • VMs stored on vm-data can be managed by Proxmox HA.

  • PBS is not HA-enabled because it is fixed on pve2.

  • Firewall is not included in this RBD-based HA setup.

  • HA behavior depends on the VM disk being stored on shared storage.

Before enabling HA for a VM, verify that its disks are stored on vm-data instead of local storage.

Ceph Components

Component

Description

Ceph MON

Maintains Ceph cluster state

Ceph OSD

Stores actual data on physical disks

Ceph Pool

Logical storage pool created on top of OSDs

RBD

Block storage interface used by Proxmox for VM disks

Failure and HA Test

We tested the HA behavior under failure conditions.

Tested scenarios:

  1. One Proxmox node lost network connectivity.

  2. One Proxmox node was unexpectedly shut down.

  3. one disk down(to be done)

Result:

  • Proxmox HA was able to detect the failed node.

  • HA-managed VMs were restarted on another available Proxmox node.

  • VMs using the shared RBD storage were able to continue operating after failover.

  • The test confirmed that the Ceph RBD shared storage and Proxmox HA setup work correctly for the tested failure cases.

This confirms that the current shared storage design supports HA failover for VMs stored on vm-data.

backup

Proxmox Backup Server (PBS) is used as the centralized backup service for virtual machines in the Proxmox cluster.

In this infrastructure, PBS is deployed as a VM on pve2. It stores VM backups from the Proxmox cluster and provides backup management through the PBS web interface and Proxmox Datacenter integration.

Deployment Notes

we choose create a PBS VM on pve2 and select /dev/nvme2n1 as storage first clean the disk then

ls -l /dev/disk/by-id/ | grep nvme#checkID
qm set 112 -scsi1 /dev/disk/by-id/nvme-ID

接著進去VMlsblk一下看到有/dev/sdb做好檔案系統後mount上去然後改fstab,接著

proxmox-backup-manager datastore create backupstore /mnt/datastore/backupstore
proxmox-backup-manager datastore list#check

然後在datacenter->storage把她加成一個proxmox backup storage在backup裡面把rule加進來

PBS Overview

Item

Value

Service

Proxmox Backup Server

Hostname

backup

IP

172.16.127.112

VM ID

112

VM location

pve2

HA

Not enabled

Datastore name

backupstore

Datastore path

/mnt/datastore/backupstore

Backup schedule

Daily at 21:00

Retention policy

Keep backups from the last 7 days

Backup scope

All VMs

Design Decision

PBS is deployed as a VM on pve2.

Because PBS itself is the backup target, it is not stored on the Ceph RBD shared storage and is not managed by Proxmox HA. Instead, PBS uses a dedicated disk attached to the PBS VM as its datastore.

This design keeps backup storage separate from the Ceph shared VM storage layer.

However, PBS is a single point of failure in the current design. If pve2 is unavailable, PBS will also be unavailable and scheduled backup jobs cannot run until pve2 and the PBS VM are restored.

Storage Design

PBS uses a dedicated disk attached to the PBS VM. The disk is mounted inside the PBS system and used as the backup datastore.

pve2
 └── PBS VM: backup / 172.16.127.112
      └── dedicated disk
           └── /mnt/datastore/backupstore
                └── PBS datastore: backupstore

backup policy

Item

Value

Backup target

PBS datastore backupstore

Backup server

172.16.127.112

Schedule

Daily at 21:00

Scope

All VMs

Retention

Keep backups from the last 7 days

flow

All VMs on Proxmox Cluster
        │
        v
Daily backup job at 21:00
        │
        v
PBS VM on pve2
        │
        v
PBS datastore: backupstore
        │
        v
/mnt/datastore/backupstore

Restore Status

Restore has not been tested yet.

This is an important remaining task. A backup system should not be considered fully verified until at least one restore test has been completed successfully.

Recommended restore test:

  • Select a non-critical VM or create a temporary test VM.

  • Run a manual backup to PBS.

  • Restore the VM from PBS.

  • Boot the restored VM.

  • Verify network connectivity and service availability.

  • Document the restore result.

log collection

The logging system is built with Grafana Loki, Grafana, and Grafana Alloy.

Loki is used as the log storage backend, Grafana is used for querying and visualizing logs, and Alloy is deployed on Proxmox nodes as the log collection agent.

Deployment Notes

create an ubuntu VM

sudo mkdir -p /etc/apt/keyrings

wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
  
sudo apt update
sudo apt install loki

sudo apt install grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

connect http://172.16.127.115:3000login in as admin and change password.connection->add new connection add loki with IP 127.0.0.1:3100

install alloy on pve

sudo mkdir -p /etc/apt/keyrings
sudo wget -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install alloy
  • /etc/alloy/config.alloy

logging {
  level  = "info"
  format = "logfmt"
}

loki.source.journal "proxmox_journal" {
  max_age       = "12h"
  relabel_rules = discovery.relabel.journal.rules
  forward_to    = [loki.write.default.receiver]

  labels = {
    job     = "systemd-journal",
    cluster = "proxmox-cluster",
  }
}

discovery.relabel "journal" {
  targets = []

  rule {
    source_labels = ["__journal__hostname"]
    target_label  = "host"
  }

  rule {
    source_labels = ["__journal__systemd_unit"]
    target_label  = "unit"
  }

  rule {
    source_labels = ["__journal__transport"]
    target_label  = "transport"
  }

  rule {
    source_labels = ["__journal__priority"]
    target_label  = "priority"
  }
}

loki.write "default" {
  endpoint {
    url = "http://172.16.127.115:3100/loki/api/v1/push"
  }
}

logger "test from $(hostname) at $(date)"測試 and we change /etc/loki/config.yml to

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: debug
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  enable_multi_variant_queries: true
  retention_period: 14d

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_store: filesystem

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf


# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false
#enable https
sudo mkdir -p /etc/nginx/ssl
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout /etc/nginx/ssl/nasa3log.csie.org.key \
  -out /etc/nginx/ssl/nasa3log.csie.org.crt \
  -subj "/CN=nasa3log.csie.org"
#/etc/nginx/sites-available/grafana
server {
    listen 80;
    server_name nasa3log.csie.org;

    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name nasa3log.csie.org;

    ssl_certificate     /etc/nginx/ssl/nasa3log.csie.org.crt;
    ssl_certificate_key /etc/nginx/ssl/nasa3log.csie.org.key;

    location / {
        proxy_pass http://127.0.0.1:3000;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header X-Forwarded-Host $host;

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Overview

Component

Host

IP

Role

Grafana

logging

172.16.127.115

Log visualization and dashboard UI

Loki

logging

172.16.127.115

Log storage backend

Grafana Alloy

pve1, pve2, pve3

Proxmox nodes

Collects systemd journal logs and forwards them to Loki

Purpose

The logging system is used to:

  • collect system logs from Proxmox nodes

  • centralize logs in Loki

  • query logs from Grafana

  • help debug Proxmox, Ceph, HA, and node-level issues

  • provide a foundation for future application log integration

Current Scope

Currently, Alloy is only installed on the three Proxmox nodes:

  • pve1

  • pve2

  • pve3

The current logging scope is limited to Proxmox node systemd journal logs.

Source

Status

Proxmox node systemd journal logs

Enabled

Ceph-related logs through journal

Available if written to systemd journal

VM application logs

Not yet integrated

External application logs

Not yet integrated

Architecture

Proxmox Nodes
 pve1 / pve2 / pve3
        │
        │ systemd journal logs
        v
Grafana Alloy
        │
        │ push logs
        v
Loki
172.16.127.115:3100
        │
        │ query
        v
Grafana
172.16.127.115:3000

logging server

Item

Value

Hostname

logging

IP

172.16.127.115

OS

Ubuntu

Grafana port

3000

Loki port

3100

Loki retention

14 days

grafana

Item

Status

Loki data source

Configured

Log query through Explore

Available

Log dashboard

Not finalized yet

Application log dashboard

Not implemented yet

Grafana Alloy

Grafana Alloy is installed on each Proxmox node.

Alloy reads logs from the systemd journal and forwards them to Loki. In this setup, Alloy collects logs from Proxmox nodes and attaches useful labels such as hostname, systemd unit, transport, and priority.

Alloy is not currently installed on project VMs or application servers.

label

Label

Source

Meaning

job

static label

Log source type, currently systemd-journal

cluster

static label

Cluster name, currently proxmox-cluster

host

__journal__hostname

Hostname of the Proxmox node

unit

__journal__systemd_unit

systemd unit name

transport

__journal__transport

Journal transport type

priority

__journal__priority

Syslog priority level

Example Queries

explore->loki

  • Query all Proxmox journal logs: {job="systemd-journal"}

  • Search logs containing a keyword: {job="systemd-journal"} |= "error"

  • guide query

usage

log is suggested writen as json format like{"time":"2026-05-09 12:00:00","level":"INFO","logger":"app","message":"hello endpoint called"} level may contain INFO WRANING ERROR

sudo mkdir -p /etc/apt/keyrings

wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
  
sudo apt install alloy

then config your alloy at /etc/alloy/config.alloy

logging {
  level  = "info"
  format = "logfmt"
}

loki.source.file "django_app" {
  targets = [
    {
      __path__ = "<path to log file>",
      job      = "<job name>",
      source   = "<source>",
      app      = "<application>",
      env      = "<env>",
      host     = "<host>",
    },
  ]

  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://172.16.127.115:3100/loki/api/v1/push"
  }
}

then

sudo systemctl enable alloy
sudo systemctl restart alloy
journalctl -u alloy -n 100 --no-pager #check that no error

trouble shooting

some error may occur when alloy doesn’t have permission to check your log file, make sure your log file has right promission

Monitoring System: Prometheus + node-exporter + Grafana

The monitoring system is built with Prometheus, node-exporter, and Grafana.

Prometheus is used to collect metrics, node-exporter is installed on Proxmox nodes to expose host-level metrics, and Grafana is used to visualize the metrics through dashboards.

Deployment Notes

on ubuntu

sudo apt install -y prometheus

and change /etc/prometheus/prometheus.yml to

# Sample config for Prometheus.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

  external_labels:
      monitor: 'example'

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    scrape_timeout: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
          - '172.16.127.16:9100'
          - '172.16.127.17:9100'
          - '172.16.127.18:9100'
  - job_name: 'ceph'
    static_configs:
      - targets:
          - '172.16.127.16:9283'
          - '172.16.127.17:9283'
          - '172.16.127.18:9283'

on pve

sudo apt install -y prometheus-node-exporter
ceph mgr module enable prometheus

Overview

Component

Host

IP

Role

Prometheus

logging

172.16.127.115

Metrics collection and storage

Grafana

logging

172.16.127.115

Metrics dashboard and visualization

node-exporter

pve1, pve2, pve3

Proxmox nodes

Exposes host metrics

Purpose

The monitoring system is used to:

  • monitor Proxmox node status

  • collect CPU, memory, disk, and network metrics

  • check whether Proxmox nodes are reachable

  • provide dashboards for infrastructure status

  • support future alerting and troubleshooting

Current Scope

Currently, node-exporter is installed on the Proxmox nodes.

Source

Status

Proxmox node metrics

Enabled

VM metrics

Not yet integrated

Application metrics

Not yet integrated

Ceph metrics

Not fully integrated yet

Alerting

Not yet configured

Architecture

Proxmox Nodes
 pve1 / pve2 / pve3
        │
        │ expose metrics on :9100
        v
node-exporter
        │
        │ scraped by Prometheus
        v
Prometheus
172.16.127.115:9090
        │
        │ queried by Grafana
        v
Grafana
172.16.127.115:3000

monitoring server

Item

Value

Hostname

logging

IP

172.16.127.115

Prometheus port

9090

Grafana port

3000

node-exporter port

9100

check it in grafana

dashbord->ststem info

Remote Backup

We currently does not have a remote storage, so these settings will be implemented as soon as we got one (預計google drive). The configuration below is tested on my machine.

architecture

  • Use existing proxmox backup server and rclone to sync the encrypted backup files to remote once every week.

  • rclone will do the encryption, decryption, and retry automatically if uploading failed.

  • Schedule: prune 1:00, gc(garbage collection) 2:00, rclone 3:00 every day

configuration

google api

待補
Obtain:
client ID
client password
需要發布應用程式以避免token過期

pbs

curl https://rclone.org/install.sh | bash
rclone config
# create google drive remote
(verify)>N, 去其他有瀏覽器的地方verify

# create encrypt remote based on 1
# backup config file!!!

timer

crontab -e

0 3 * * * /usr/bin/rclone sync /mnt/datastore/backupstorage gdrive:pbs_backup --config ~/.config/rclone/rclone.conf --fast-list --checkers 64 --transfers 16 --delete-after --retries 3 --retries-sleep 10s >> /var/log/rclone-cron.log 2>&1
  • Currently use --delete-after. Be aware that the remote may MLE

  • --checkers 64 --transfers 16 does not matter much based on local testing

Known issue

  • Be aware of token expire problem