hardware infra document

System Architecture

Our infrastructure is built on a three-node Proxmox cluster. The cluster provides the main virtualization platform for project services and internal virtual machines. In addition to the Proxmox cluster, we deploy dedicated infrastructure services for shared storage, backup, logging, and monitoring.

Core Components

Component	Hostname	IP	Role
Proxmox Node 1	pve1	172.16.127.16	Proxmox cluster node
Proxmox Node 2	pve2	172.16.127.17	Proxmox cluster node; hosts the PBS VM
Proxmox Node 3	pve3	172.16.127.18	Proxmox cluster node
Proxmox Backup Server	backup	172.16.127.112	Backup service for Proxmox VMs
Logging / Monitoring Server	logging	172.16.127.115	Grafana, Loki, and Prometheus

High-Level Design

The infrastructure is divided into four main layers:

Virtualization Layer

The virtualization layer is provided by a three-node Proxmox cluster consisting of pve1, pve2, and pve3. These nodes are joined into the same Proxmox cluster and are managed under the same Proxmox datacenter view.
Shared Storage Layer

Ceph is officially enabled on the Proxmox cluster. An RBD storage pool is created and used as shared storage for VM disks. This allows VM storage to be shared across the Proxmox nodes.
Backup Layer

Proxmox Backup Server is deployed as a VM on pve2. The PBS VM is used to store and manage VM backups from the Proxmox cluster.

Currently, the PBS VM is fixed on pve2 and is not configured with High Availability. If pve2 goes down, the backup service will not automatically migrate to another node.
Logging and Monitoring Layer

Grafana, Loki, and Prometheus are deployed on the logging server.
- Loki is used to store logs.
- Grafana is used to visualize logs and metrics.
- Prometheus is used to collect metrics.
- Grafana Alloy is installed on Proxmox nodes to collect systemd journal logs and forward them to Loki.
- node-exporter is installed on Proxmox nodes to expose host metrics to Prometheus.

Architecture Diagram

                           ┌─────────────────────────────┐
                           │        Proxmox Cluster      │
                           │                             │
                           │  ┌──────┐ ┌──────┐ ┌──────┐ │
                           │  │ pve1 │ │ pve2 │ │ pve3 │ │
                           │  └──────┘ └──────┘ └──────┘ │
                           └──────────────┬──────────────┘
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    │                     │                     │
                    v                     v                     v
          ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────────┐
          │   Ceph RBD      │   │ Proxmox Backup  │   │ Logging / Monitoring│
          │ Shared Storage  │   │ Server VM       │   │ Server              │
          │                 │   │                 │   │                     │
          │ VM disk storage │   │ runs on pve2    │   │ Grafana             │
          └─────────────────┘   │ no HA enabled   │   │ Loki                │
                                └─────────────────┘   │ Prometheus          │
                                                      └─────────────────────┘

Known issue

有些機器不能自然重開機要過bios，推測前人沒刪乾淨 (所以sc組的若重開proxmox有機會需要去機房開QQ)

Network Design

This infrastructure mainly uses an internal private network for Proxmox nodes, physical machines, and virtual machines. Public IP addresses are only used where external access is required.

Internal Network

The internal network uses the following subnet:

Item	Value
Internal subnet	`172.16.0.0/16`
Gateway	`172.16.0.1`
Project usable range	`172.16.127.0` - `172.16.127.200`

The project mainly uses the 172.16.127.0/24 range for infrastructure machines, Proxmox nodes, and virtual machines. IP addresses after 172.16.127.200 are reserved for the SC machines.

IP Allocation

Range	Usage
`172.16.127.11` - `172.16.127.15`	SC physical machines
`172.16.127.16` - `172.16.127.18`	Proxmox cluster nodes
`172.16.127.101` and above	Project VMs and infrastructure services
`172.16.127.112`	Proxmox Backup Server
`172.16.127.115`	Logging / monitoring server

Important Infrastructure IPs

Name	Hostname	IP	Usage
SC1	`sc1`	`172.16.127.11`	SC physical machine
SC2	`sc2`	`172.16.127.12`	SC physical machine
SC3	`sc3`	`172.16.127.13`	SC physical machine
SC4	`sc4`	`172.16.127.14`	SC physical machine
SC5	`sc5`	`172.16.127.15`	SC physical machine
RM1	`pve1`	`172.16.127.16`	Proxmox node
RM2	`pve2`	`172.16.127.17`	Proxmox node
RM3	`pve3`	`172.16.127.18`	Proxmox node
PBS	`backup`	`172.16.127.112`	Backup server
Logging	`logging`	`172.16.127.115`	Grafana, Loki, Prometheus

Public Network

The public IP range is available through VLAN tagging.

Item	Value
Public IP range	`140.112.187.48-55/27`
Public gateway	`140.112.187.62`
VLAN tag	`187`

Currently, public IP addresses are only used by the firewall. Most internal services and VMs use the private 172.16.0.0/16 network.

how to configure pve’s IP

ip a
vi /etc/network/interfaces
systemctl restart networking
ip link set up ...

/etc/network/interfaces

iface nic1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 172.16.127.12/16
    gateway 172.16.0.1
    bridge-ports nic1
    bridge-stp off
    bridge-fd 0

changing hostname

vi /etc/hosts ## sc1.nasa sc1
vi /etc/hostname ## sc1
reboot

VM Network Design

All VMs are connected through the Proxmox Linux bridge vmbr0

VM network interface
        │
        v
Proxmox vmbr0
        │
        v
Physical network interface
        │
        v
Internal network: 172.16.0.0/16

how to configure VM statis IP

A typical VM network configuration uses a static internal IP address, the internal gateway, and public DNS resolvers.

ubuntu

modify /etc/netplan/*.yaml

network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses:
        - 172.16.127.101/16
      routes:
        - to: default
          via: 172.16.0.1
      nameservers:
        addresses:
          - 8.8.8.8
          - 1.1.1.1

sudo netplan apply

arch linux

sudo nmcli connection modify "Wired connection 1" \
  ipv4.addresses 172.16.127.105/24 \
  ipv4.gateway 172.16.127.1 \
  ipv4.dns "8.8.8.8 1.1.1.1" \
  ipv4.method manual

windows(and installation guide)

installwindows isoanddriver iso
再進create VM的時候記得裝網路驅動如果沒裝可以在進windows之後再裝
跳過連上網路步驟：shift+F10叫出cmd然後輸入OOBE\BYPASSNRO重啟

去設定, ethernet, IP assignment, 設定Manual, 打開IPv4

IP: 172.16.127.106
subnet mask: 255.255.0.0
gateway: 172.16.0.1
DNS: 8.8.8.8

VMs network and service

IP	hostname	username	os
172.16.127.101	room1-nasa3	room1	ubuntu
172.16.127.102	mail1-nasa3	mail1	ubuntu
172.16.127.103	printu1-nasa3	printu1	ubuntu
172.16.127.104	database1-nasa3	database1	ubuntu
172.16.127.105	iden1-nasa3	iden1	arch
172.16.127.106	printw1-nasa3	printw1	windows
172.16.127.107	room2-nasa3	room2	ubuntu
172.16.127.108	room3-nasa3	room3	ubuntu
172.16.127.109	iden2-nasa3	iden2	arch
172.16.127.110	wifi1-nasa3	wifi1	ubuntu
172.16.127.112	backup	root	proxmox-backup-server
172.16.127.113	uiux1	uiux1	ubuntu
172.16.127.114	uiux2	uiux2	ubuntu
172.16.127.115	log	logging	ubuntu
172.16.127.116	mail2	mail2	ubuntu
172.16.127.117	mail3	mail3	ubuntu

Known issue

USB NIC Instability since the bios is in old version, some SC machines use USB network adapters(sc1, sc2, sc4). One known issue is that the USB NIC on sc1 (172.16.127.11) may automatically disconnect after being connected for a short time.

To mitigate this issue, USB autosuspend should be disabled through GRUB.

Edit /etc/default/grub and add: usbcore.autosuspend=-1 eg.

GRUB_CMDLINE_LINUX_DEFAULT="quiet usbcore.autosuspend=-1"

and then update grub

sudo update-grub

有一些人有假的enx...，但我們沒有把假的介面刪掉

Proxmox Cluster Setup

The virtualization platform is built as a three-node Proxmox cluster. The cluster was initially created on pve1, and pve2 and pve3 were later joined into the same cluster.

The cluster allows all Proxmox nodes to be managed under the same datacenter view and provides the base infrastructure for VM management, shared storage, backup, and monitoring.

Cluster Overview

Node	Hostname	IP	Role
RM1	`pve1`	`172.16.127.16`	Proxmox cluster node; initial cluster creation node
RM2	`pve2`	`172.16.127.17`	Proxmox cluster node
RM3	`pve3`	`172.16.127.18`	Proxmox cluster node

All three nodes are connected through the internal network 172.16.0.0/16.
The Proxmox cluster communication uses the network interface configured on the Proxmox nodes.

Cluster Creation

The cluster was created from pve1.

After the cluster was created, the other two Proxmox nodes were joined into the cluster:

pve1 creates the cluster
pve2 joins the cluster
pve3 joins the cluster

commands

pvecm create rm-nasa
pvecm join 172.16.127.16 

Cluster Purpose

The Proxmox cluster is used to:

manage all Proxmox nodes from a single datacenter view
host project virtual machines
provide the base environment for Ceph shared storage
integrate with Proxmox Backup Server
provide the foundation for HA and VM migration

HA and VM migration depend on shared storage and are documented in the shared storage section.

Note

USB network adapters should not be used for Proxmox cluster communication because they may be unstable. In this setup, USB NIC issues are limited to SC machines and are documented in the Network Design section.
sou can use pvecm status to verify cluster’s health

Known Issue

Cluster Reinstallation

During earlier setup, the machine using 172.16.127.17 was replaced. At that time, old cluster information could not be fully removed from pve1 and pve3, so all three Proxmox nodes were reinstalled and the cluster was rebuilt from scratch.

This issue is important for future maintenance:

avoid changing cluster node identity casually keep hostname and IP address consistent remove a node cleanly before replacing it verify cluster state before rejoining a replaced node if old cluster metadata cannot be removed cleanly, reinstallation may be required

Share storage and HA

Ceph is used as the shared storage backend for the Proxmox cluster.
All three Proxmox nodes participate in the Ceph cluster, and each Proxmox node provides one Ceph OSD.

A Ceph RBD pool named vm-data is used as shared VM disk storage.

working history

setup flow

Enable Proxmox no-subscription repository
Install Ceph on all Proxmox nodes
Create Ceph monitors
Prepare one OSD disk on each Proxmox node
Create OSDs
Create the Ceph pool vm-data
Set pool size=3 and min_size=2
Add vm-data to Proxmox as RBD storage
Move VM disks to vm-data when HA/migration is required
Enable HA for VMs stored on vm-data
Test VM migration and HA failover

install ceph

Before installing Ceph, the Proxmox no-subscription repositories should be enabled.

Example repository configuration:

mv /etc/apt/sources.list.d/pve-enterprise.sources /etc/apt/sources.list.d/pve-enterprise.sources.bak
mv /etc/apt/sources.list.d/ceph.sources /etc/apt/sources.list.d/ceph.sources.bak
cat >/etc/apt/sources.list.d/proxmox.sources <<'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription

Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
EOF
apt update
apt upgrade

When installing Ceph from the Proxmox web UI, choose the no-subscription repository. then install monitor in webgui.Then install osd

instal OSD

Before creating an OSD, the selected disk should be cleaned.

wipefs -a /dev/nvme1n1
sgdisk --zap-all /dev/nvme1n1
partprobe /dev/nvme1n1

ceph->OSD->create OSD datacenter->pool->create poolvreate a pool and check that on datacenter->storage

Purpose

The shared storage layer is used to:

provide shared VM disk storage across pve1, pve2, and pve3
allow VMs to be migrated between Proxmox nodes
support Proxmox HA for selected VMs
avoid tying VM disks to only one physical host

Current Ceph Configuration

Item	Value
Ceph nodes	`pve1`, `pve2`, `pve3`
OSD layout	one OSD per Proxmox node
RBD pool name	`vm-data`
Pool size	`3`
Pool min_size	`2`
Usage	shared VM disk storage

The pool uses size=3, which means Ceph keeps three replicated copies of data.
The min_size=2 setting means the pool can continue operating as long as at least two replicas are available.

VM Storage Policy

Most VM disks are stored on the Ceph RBD pool vm-data.

Exceptions:

VM / Service	Storage Policy	Reason
Proxmox Backup Server	Not stored on RBD	PBS is fixed on `pve2` and uses a dedicated disk/datastore
Firewall	Not stored on RBD	Firewall storage is managed separately
Other VMs	Stored on RBD	Supports migration and HA

HA Policy

HA is enabled for VMs whose disks are stored on shared RBD storage.

In this setup:

VMs stored on vm-data can be managed by Proxmox HA.
PBS is not HA-enabled because it is fixed on pve2.
Firewall is not included in this RBD-based HA setup.
HA behavior depends on the VM disk being stored on shared storage.

Before enabling HA for a VM, verify that its disks are stored on vm-data instead of local storage.

Ceph Components

Component	Description
Ceph MON	Maintains Ceph cluster state
Ceph OSD	Stores actual data on physical disks
Ceph Pool	Logical storage pool created on top of OSDs
RBD	Block storage interface used by Proxmox for VM disks

Failure and HA Test

We tested the HA behavior under failure conditions.

Tested scenarios:

One Proxmox node lost network connectivity.
One Proxmox node was unexpectedly shut down.
one disk down(to be done)

Result:

Proxmox HA was able to detect the failed node.
HA-managed VMs were restarted on another available Proxmox node.
VMs using the shared RBD storage were able to continue operating after failover.
The test confirmed that the Ceph RBD shared storage and Proxmox HA setup work correctly for the tested failure cases.

This confirms that the current shared storage design supports HA failover for VMs stored on vm-data.

backup

Proxmox Backup Server (PBS) is used as the centralized backup service for virtual machines in the Proxmox cluster.

In this infrastructure, PBS is deployed as a VM on pve2. It stores VM backups from the Proxmox cluster and provides backup management through the PBS web interface and Proxmox Datacenter integration.

Deployment Notes

we choose create a PBS VM on pve2 and select /dev/nvme2n1 as storage first clean the disk then

ls -l /dev/disk/by-id/ | grep nvme#checkID
qm set 112 -scsi1 /dev/disk/by-id/nvme-ID

接著進去VMlsblk一下看到有/dev/sdb做好檔案系統後mount上去然後改fstab，接著

proxmox-backup-manager datastore create backupstore /mnt/datastore/backupstore
proxmox-backup-manager datastore list#check

然後在datacenter->storage把她加成一個proxmox backup storage在backup裡面把rule加進來

PBS Overview

Item	Value
Service	Proxmox Backup Server
Hostname	`backup`
IP	`172.16.127.112`
VM ID	`112`
VM location	`pve2`
HA	Not enabled
Datastore name	`backupstore`
Datastore path	`/mnt/datastore/backupstore`
Backup schedule	Daily at 21:00
Retention policy	Keep backups from the last 7 days
Backup scope	All VMs

Design Decision

PBS is deployed as a VM on pve2.

Because PBS itself is the backup target, it is not stored on the Ceph RBD shared storage and is not managed by Proxmox HA. Instead, PBS uses a dedicated disk attached to the PBS VM as its datastore.

This design keeps backup storage separate from the Ceph shared VM storage layer.

However, PBS is a single point of failure in the current design. If pve2 is unavailable, PBS will also be unavailable and scheduled backup jobs cannot run until pve2 and the PBS VM are restored.

Storage Design

PBS uses a dedicated disk attached to the PBS VM. The disk is mounted inside the PBS system and used as the backup datastore.

pve2
 └── PBS VM: backup / 172.16.127.112
      └── dedicated disk
           └── /mnt/datastore/backupstore
                └── PBS datastore: backupstore

backup policy

Item	Value
Backup target	PBS datastore `backupstore`
Backup server	`172.16.127.112`
Schedule	Daily at 21:00
Scope	All VMs
Retention	Keep backups from the last 7 days

flow

All VMs on Proxmox Cluster
        │
        v
Daily backup job at 21:00
        │
        v
PBS VM on pve2
        │
        v
PBS datastore: backupstore
        │
        v
/mnt/datastore/backupstore

Restore Status

Restore has not been tested yet.

This is an important remaining task. A backup system should not be considered fully verified until at least one restore test has been completed successfully.

Recommended restore test:

Select a non-critical VM or create a temporary test VM.
Run a manual backup to PBS.
Restore the VM from PBS.
Boot the restored VM.
Verify network connectivity and service availability.
Document the restore result.

log collection

The logging system is built with Grafana Loki, Grafana, and Grafana Alloy.

Loki is used as the log storage backend, Grafana is used for querying and visualizing logs, and Alloy is deployed on Proxmox nodes as the log collection agent.

Deployment Notes

create an ubuntu VM

sudo mkdir -p /etc/apt/keyrings

wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
  
sudo apt update
sudo apt install loki

sudo apt install grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

connect http://172.16.127.115:3000login in as admin and change password.connection->add new connection add loki with IP 127.0.0.1:3100

install alloy on pve

sudo mkdir -p /etc/apt/keyrings
sudo wget -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install alloy

/etc/alloy/config.alloy

logging {
  level  = "info"
  format = "logfmt"
}

loki.source.journal "proxmox_journal" {
  max_age       = "12h"
  relabel_rules = discovery.relabel.journal.rules
  forward_to    = [loki.write.default.receiver]

  labels = {
    job     = "systemd-journal",
    cluster = "proxmox-cluster",
  }
}

discovery.relabel "journal" {
  targets = []

  rule {
    source_labels = ["__journal__hostname"]
    target_label  = "host"
  }

  rule {
    source_labels = ["__journal__systemd_unit"]
    target_label  = "unit"
  }

  rule {
    source_labels = ["__journal__transport"]
    target_label  = "transport"
  }

  rule {
    source_labels = ["__journal__priority"]
    target_label  = "priority"
  }
}

loki.write "default" {
  endpoint {
    url = "http://172.16.127.115:3100/loki/api/v1/push"
  }
}

logger "test from $(hostname) at $(date)"測試 and we change /etc/loki/config.yml to

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: debug
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  enable_multi_variant_queries: true
  retention_period: 14d

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_store: filesystem

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf


# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

#enable https
sudo mkdir -p /etc/nginx/ssl
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout /etc/nginx/ssl/nasa3log.csie.org.key \
  -out /etc/nginx/ssl/nasa3log.csie.org.crt \
  -subj "/CN=nasa3log.csie.org"

#/etc/nginx/sites-available/grafana
server {
    listen 80;
    server_name nasa3log.csie.org;

    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name nasa3log.csie.org;

    ssl_certificate     /etc/nginx/ssl/nasa3log.csie.org.crt;
    ssl_certificate_key /etc/nginx/ssl/nasa3log.csie.org.key;

    location / {
        proxy_pass http://127.0.0.1:3000;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header X-Forwarded-Host $host;

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Overview

Component	Host	IP	Role
Grafana	`logging`	`172.16.127.115`	Log visualization and dashboard UI
Loki	`logging`	`172.16.127.115`	Log storage backend
Grafana Alloy	`pve1`, `pve2`, `pve3`	Proxmox nodes	Collects systemd journal logs and forwards them to Loki

Purpose

The logging system is used to:

collect system logs from Proxmox nodes
centralize logs in Loki
query logs from Grafana
help debug Proxmox, Ceph, HA, and node-level issues
provide a foundation for future application log integration

Current Scope

Currently, Alloy is only installed on the three Proxmox nodes:

pve1
pve2
pve3

The current logging scope is limited to Proxmox node systemd journal logs.

Source	Status
Proxmox node systemd journal logs	Enabled
Ceph-related logs through journal	Available if written to systemd journal
VM application logs	Not yet integrated
External application logs	Not yet integrated

Architecture

Proxmox Nodes
 pve1 / pve2 / pve3
        │
        │ systemd journal logs
        v
Grafana Alloy
        │
        │ push logs
        v
Loki
172.16.127.115:3100
        │
        │ query
        v
Grafana
172.16.127.115:3000

logging server

Item	Value
Hostname	`logging`
IP	`172.16.127.115`
OS	Ubuntu
Grafana port	`3000`
Loki port	`3100`
Loki retention	`14 days`

grafana

Item	Status
Loki data source	Configured
Log query through Explore	Available
Log dashboard	Not finalized yet
Application log dashboard	Not implemented yet

Grafana Alloy

Grafana Alloy is installed on each Proxmox node.

Alloy reads logs from the systemd journal and forwards them to Loki. In this setup, Alloy collects logs from Proxmox nodes and attaches useful labels such as hostname, systemd unit, transport, and priority.

Alloy is not currently installed on project VMs or application servers.

label

Label	Source	Meaning
`job`	static label	Log source type, currently `systemd-journal`
`cluster`	static label	Cluster name, currently `proxmox-cluster`
`host`	`__journal__hostname`	Hostname of the Proxmox node
`unit`	`__journal__systemd_unit`	systemd unit name
`transport`	`__journal__transport`	Journal transport type
`priority`	`__journal__priority`	Syslog priority level

Example Queries

explore->loki

Query all Proxmox journal logs: {job="systemd-journal"}
Search logs containing a keyword: {job="systemd-journal"} |= "error"
guide query

usage

log is suggested writen as json format like{"time":"2026-05-09 12:00:00","level":"INFO","logger":"app","message":"hello endpoint called"} level may contain INFO WRANING ERROR

sudo mkdir -p /etc/apt/keyrings

wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
  
sudo apt install alloy

then config your alloy at /etc/alloy/config.alloy

logging {
  level  = "info"
  format = "logfmt"
}

loki.source.file "django_app" {
  targets = [
    {
      __path__ = "<path to log file>",
      job      = "<job name>",
      source   = "<source>",
      app      = "<application>",
      env      = "<env>",
      host     = "<host>",
    },
  ]

  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://172.16.127.115:3100/loki/api/v1/push"
  }
}

then

sudo systemctl enable alloy
sudo systemctl restart alloy
journalctl -u alloy -n 100 --no-pager #check that no error

trouble shooting

some error may occur when alloy doesn’t have permission to check your log file, make sure your log file has right promission

Monitoring System: Prometheus + node-exporter + Grafana

The monitoring system is built with Prometheus, node-exporter, and Grafana.

Prometheus is used to collect metrics, node-exporter is installed on Proxmox nodes to expose host-level metrics, and Grafana is used to visualize the metrics through dashboards.

Deployment Notes

on ubuntu

sudo apt install -y prometheus

and change /etc/prometheus/prometheus.yml to

# Sample config for Prometheus.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

  external_labels:
      monitor: 'example'

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    scrape_timeout: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
          - '172.16.127.16:9100'
          - '172.16.127.17:9100'
          - '172.16.127.18:9100'
  - job_name: 'ceph'
    static_configs:
      - targets:
          - '172.16.127.16:9283'
          - '172.16.127.17:9283'
          - '172.16.127.18:9283'

on pve

sudo apt install -y prometheus-node-exporter
ceph mgr module enable prometheus

Overview

Component	Host	IP	Role
Prometheus	`logging`	`172.16.127.115`	Metrics collection and storage
Grafana	`logging`	`172.16.127.115`	Metrics dashboard and visualization
node-exporter	`pve1`, `pve2`, `pve3`	Proxmox nodes	Exposes host metrics

Purpose

The monitoring system is used to:

monitor Proxmox node status
collect CPU, memory, disk, and network metrics
check whether Proxmox nodes are reachable
provide dashboards for infrastructure status
support future alerting and troubleshooting

Current Scope

Currently, node-exporter is installed on the Proxmox nodes.

Source	Status
Proxmox node metrics	Enabled
VM metrics	Not yet integrated
Application metrics	Not yet integrated
Ceph metrics	Not fully integrated yet
Alerting	Not yet configured

Architecture

Proxmox Nodes
 pve1 / pve2 / pve3
        │
        │ expose metrics on :9100
        v
node-exporter
        │
        │ scraped by Prometheus
        v
Prometheus
172.16.127.115:9090
        │
        │ queried by Grafana
        v
Grafana
172.16.127.115:3000

monitoring server

Item	Value
Hostname	`logging`
IP	`172.16.127.115`
Prometheus port	`9090`
Grafana port	`3000`
node-exporter port	`9100`

check it in grafana

dashbord->ststem info

Remote Backup

We currently does not have a remote storage, so these settings will be implemented as soon as we got one (預計google drive). The configuration below is tested on my machine.

architecture

Use existing proxmox backup server and rclone to sync the encrypted backup files to remote once every week.
rclone will do the encryption, decryption, and retry automatically if uploading failed.
Schedule: prune 1:00, gc(garbage collection) 2:00, rclone 3:00 every day

configuration

google api

待補
Obtain:
client ID
client password
需要發布應用程式以避免token過期

pbs

curl https://rclone.org/install.sh | bash
rclone config
# create google drive remote
(verify)>N, 去其他有瀏覽器的地方verify

# create encrypt remote based on 1
# backup config file!!!

timer

crontab -e

0 3 * * * /usr/bin/rclone sync /mnt/datastore/backupstorage gdrive:pbs_backup --config ~/.config/rclone/rclone.conf --fast-list --checkers 64 --transfers 16 --delete-after --retries 3 --retries-sleep 10s >> /var/log/rclone-cron.log 2>&1

Currently use --delete-after. Be aware that the remote may MLE
--checkers 64 --transfers 16 does not matter much based on local testing

Known issue

Be aware of token expire problem