# hardware infra document [工作紀錄](https://hackmd.io/SxOoymmkSNiYl8lURLt2tA) ## System Architecture Our infrastructure is built on a three-node Proxmox cluster. The cluster provides the main virtualization platform for project services and internal virtual machines. In addition to the Proxmox cluster, we deploy dedicated infrastructure services for shared storage, backup, logging, and monitoring. ### Core Components | Component | Hostname | IP | Role | |---|---|---:|---| | Proxmox Node 1 | pve1 | 172.16.127.16 | Proxmox cluster node | | Proxmox Node 2 | pve2 | 172.16.127.17 | Proxmox cluster node; hosts the PBS VM | | Proxmox Node 3 | pve3 | 172.16.127.18 | Proxmox cluster node | | Proxmox Backup Server | backup | 172.16.127.112 | Backup service for Proxmox VMs | | Logging / Monitoring Server | logging | 172.16.127.115 | Grafana, Loki, and Prometheus | ### High-Level Design The infrastructure is divided into four main layers: 1. **Virtualization Layer** The virtualization layer is provided by a three-node Proxmox cluster consisting of `pve1`, `pve2`, and `pve3`. These nodes are joined into the same Proxmox cluster and are managed under the same Proxmox datacenter view. 2. **Shared Storage Layer** Ceph is officially enabled on the Proxmox cluster. An RBD storage pool is created and used as shared storage for VM disks. This allows VM storage to be shared across the Proxmox nodes. 3. **Backup Layer** Proxmox Backup Server is deployed as a VM on `pve2`. The PBS VM is used to store and manage VM backups from the Proxmox cluster. Currently, the PBS VM is fixed on `pve2` and is not configured with High Availability. If `pve2` goes down, the backup service will not automatically migrate to another node. 4. **Logging and Monitoring Layer** Grafana, Loki, and Prometheus are deployed on the logging server. - Loki is used to store logs. - Grafana is used to visualize logs and metrics. - Prometheus is used to collect metrics. - Grafana Alloy is installed on Proxmox nodes to collect systemd journal logs and forward them to Loki. - node-exporter is installed on Proxmox nodes to expose host metrics to Prometheus. ### Architecture Diagram ```text ┌─────────────────────────────┐ │ Proxmox Cluster │ │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ pve1 │ │ pve2 │ │ pve3 │ │ │ └──────┘ └──────┘ └──────┘ │ └──────────────┬──────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ v v v ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │ Ceph RBD │ │ Proxmox Backup │ │ Logging / Monitoring│ │ Shared Storage │ │ Server VM │ │ Server │ │ │ │ │ │ │ │ VM disk storage │ │ runs on pve2 │ │ Grafana │ └─────────────────┘ │ no HA enabled │ │ Loki │ └─────────────────┘ │ Prometheus │ └─────────────────────┘ ``` ### Known issue - 有些機器不能自然重開機要過bios,推測前人沒刪乾淨 (所以sc組的若重開proxmox有機會需要去機房開QQ) ## Network Design This infrastructure mainly uses an internal private network for Proxmox nodes, physical machines, and virtual machines. Public IP addresses are only used where external access is required. ### Internal Network The internal network uses the following subnet: | Item | Value | |---|---| | Internal subnet | `172.16.0.0/16` | | Gateway | `172.16.0.1` | | Project usable range | `172.16.127.0` - `172.16.127.200` | The project mainly uses the `172.16.127.0/24` range for infrastructure machines, Proxmox nodes, and virtual machines. IP addresses after `172.16.127.200` are reserved for the SC machines. ### IP Allocation | Range | Usage | |---|---| | `172.16.127.11` - `172.16.127.15` | SC physical machines | | `172.16.127.16` - `172.16.127.18` | Proxmox cluster nodes | | `172.16.127.101` and above | Project VMs and infrastructure services | | `172.16.127.112` | Proxmox Backup Server | | `172.16.127.115` | Logging / monitoring server | ### Important Infrastructure IPs | Name | Hostname | IP | Usage | |---|---|---:|---| | SC1 | `sc1` | `172.16.127.11` | SC physical machine | | SC2 | `sc2` | `172.16.127.12` | SC physical machine | | SC3 | `sc3` | `172.16.127.13` | SC physical machine | | SC4 | `sc4` | `172.16.127.14` | SC physical machine | | SC5 | `sc5` | `172.16.127.15` | SC physical machine | | RM1 | `pve1` | `172.16.127.16` | Proxmox node | | RM2 | `pve2` | `172.16.127.17` | Proxmox node | | RM3 | `pve3` | `172.16.127.18` | Proxmox node | | PBS | `backup` | `172.16.127.112` | Backup server | | Logging | `logging` | `172.16.127.115` | Grafana, Loki, Prometheus | ### Public Network The public IP range is available through VLAN tagging. | Item | Value | |---|---| | Public IP range | `140.112.187.48-55/27` | | Public gateway | `140.112.187.62` | | VLAN tag | `187` | Currently, public IP addresses are only used by the firewall. Most internal services and VMs use the private `172.16.0.0/16` network. ### how to configure pve's IP ```bash ip a vi /etc/network/interfaces systemctl restart networking ip link set up ... ``` - `/etc/network/interfaces` ``` iface nic1 inet manual auto vmbr0 iface vmbr0 inet static address 172.16.127.12/16 gateway 172.16.0.1 bridge-ports nic1 bridge-stp off bridge-fd 0 ``` #### changing hostname ```bash vi /etc/hosts ## sc1.nasa sc1 vi /etc/hostname ## sc1 reboot ``` ### VM Network Design All VMs are connected through the Proxmox Linux bridge `vmbr0` ```text VM network interface │ v Proxmox vmbr0 │ v Physical network interface │ v Internal network: 172.16.0.0/16 ``` #### how to configure VM statis IP A typical VM network configuration uses a static internal IP address, the internal gateway, and public DNS resolvers. #### ubuntu modify /etc/netplan/*.yaml ``` network: version: 2 ethernets: ens18: dhcp4: false addresses: - 172.16.127.101/16 routes: - to: default via: 172.16.0.1 nameservers: addresses: - 8.8.8.8 - 1.1.1.1 ``` ```bash sudo netplan apply ``` #### arch linux ```bash sudo nmcli connection modify "Wired connection 1" \ ipv4.addresses 172.16.127.105/24 \ ipv4.gateway 172.16.127.1 \ ipv4.dns "8.8.8.8 1.1.1.1" \ ipv4.method manual ``` #### windows(and installation guide) 1. install[windows iso](https://www.microsoft.com/zh-tw/software-download/windows11)and[driver iso](https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/?utm_source=chatgpt.com) 2. 再進create VM的時候記得裝網路驅動 如果沒裝可以在進windows之後再裝 3. 跳過連上網路步驟:shift+F10叫出cmd然後輸入`OOBE\BYPASSNRO`重啟 4. 去設定, ethernet, IP assignment, 設定Manual, 打開IPv4 ``` IP: 172.16.127.106 subnet mask: 255.255.0.0 gateway: 172.16.0.1 DNS: 8.8.8.8 ``` ### VMs network and service | IP | hostname | username | os | | -- | -- | -- | -- | | 172.16.127.101 | room1-nasa3 | room1 | ubuntu | | 172.16.127.102 | mail1-nasa3 | mail1 | ubuntu | | 172.16.127.103 | printu1-nasa3 | printu1 | ubuntu | | 172.16.127.104 | database1-nasa3 | database1 | ubuntu | | 172.16.127.105 | iden1-nasa3 | iden1 | arch | | 172.16.127.106 | printw1-nasa3 | printw1 | windows | | 172.16.127.107 | room2-nasa3 | room2 | ubuntu | | 172.16.127.108 | room3-nasa3 | room3 | ubuntu | | 172.16.127.109 | iden2-nasa3 | iden2 | arch | | 172.16.127.110 | wifi1-nasa3 | wifi1 | ubuntu | | 172.16.127.112 | backup | root | proxmox-backup-server | | 172.16.127.113 | uiux1 | uiux1 | ubuntu | | 172.16.127.114 | uiux2 | uiux2 | ubuntu | | 172.16.127.115 | log | logging | ubuntu | | 172.16.127.116 | mail2 | mail2 | ubuntu | | 172.16.127.117 | mail3 | mail3 | ubuntu | ### Known issue - USB NIC Instability since the bios is in old version, some SC machines use USB network adapters(sc1, sc2, sc4). One known issue is that the USB NIC on sc1 (172.16.127.11) may automatically disconnect after being connected for a short time. To mitigate this issue, USB autosuspend should be disabled through GRUB. Edit ```/etc/default/grub``` and add: ```usbcore.autosuspend=-1``` eg. ``` GRUB_CMDLINE_LINUX_DEFAULT="quiet usbcore.autosuspend=-1" ``` and then update grub ``` sudo update-grub ``` - 有一些人有假的`enx...`,但我們沒有把假的介面刪掉 ## Proxmox Cluster Setup The virtualization platform is built as a three-node Proxmox cluster. The cluster was initially created on `pve1`, and `pve2` and `pve3` were later joined into the same cluster. The cluster allows all Proxmox nodes to be managed under the same datacenter view and provides the base infrastructure for VM management, shared storage, backup, and monitoring. ### Cluster Overview | Node | Hostname | IP | Role | |---|---|---:|---| | RM1 | `pve1` | `172.16.127.16` | Proxmox cluster node; initial cluster creation node | | RM2 | `pve2` | `172.16.127.17` | Proxmox cluster node | | RM3 | `pve3` | `172.16.127.18` | Proxmox cluster node | All three nodes are connected through the internal network `172.16.0.0/16`. The Proxmox cluster communication uses the network interface configured on the Proxmox nodes. ### Cluster Creation The cluster was created from `pve1`. After the cluster was created, the other two Proxmox nodes were joined into the cluster: ```text pve1 creates the cluster pve2 joins the cluster pve3 joins the cluster ``` commands ```bash pvecm create rm-nasa pvecm join 172.16.127.16 ``` Cluster Purpose The Proxmox cluster is used to: - manage all Proxmox nodes from a single datacenter view - host project virtual machines - provide the base environment for Ceph shared storage - integrate with Proxmox Backup Server - provide the foundation for HA and VM migration HA and VM migration depend on shared storage and are documented in the shared storage section. ### Note - USB network adapters should not be used for Proxmox cluster communication because they may be unstable. In this setup, USB NIC issues are limited to SC machines and are documented in the Network Design section. - sou can use ```pvecm status``` to verify cluster's health ### Known Issue - Cluster Reinstallation During earlier setup, the machine using 172.16.127.17 was replaced. At that time, old cluster information could not be fully removed from pve1 and pve3, so all three Proxmox nodes were reinstalled and the cluster was rebuilt from scratch. This issue is important for future maintenance: avoid changing cluster node identity casually keep hostname and IP address consistent remove a node cleanly before replacing it verify cluster state before rejoining a replaced node if old cluster metadata cannot be removed cleanly, reinstallation may be required ## Share storage and HA Ceph is used as the shared storage backend for the Proxmox cluster. All three Proxmox nodes participate in the Ceph cluster, and each Proxmox node provides one Ceph OSD. A Ceph RBD pool named `vm-data` is used as shared VM disk storage. ### working history #### setup flow ``` 1. Enable Proxmox no-subscription repository 2. Install Ceph on all Proxmox nodes 3. Create Ceph monitors 4. Prepare one OSD disk on each Proxmox node 5. Create OSDs 6. Create the Ceph pool vm-data 7. Set pool size=3 and min_size=2 8. Add vm-data to Proxmox as RBD storage 9. Move VM disks to vm-data when HA/migration is required 10. Enable HA for VMs stored on vm-data 11. Test VM migration and HA failover ``` #### install ceph Before installing Ceph, the Proxmox no-subscription repositories should be enabled. Example repository configuration: ```bash mv /etc/apt/sources.list.d/pve-enterprise.sources /etc/apt/sources.list.d/pve-enterprise.sources.bak mv /etc/apt/sources.list.d/ceph.sources /etc/apt/sources.list.d/ceph.sources.bak cat >/etc/apt/sources.list.d/proxmox.sources <<'EOF' Types: deb URIs: http://download.proxmox.com/debian/pve Suites: trixie Components: pve-no-subscription Types: deb URIs: http://download.proxmox.com/debian/ceph-squid Suites: trixie Components: no-subscription EOF apt update apt upgrade ``` When installing Ceph from the Proxmox web UI, choose the ```no-subscription``` repository. then install monitor in webgui.Then install osd #### instal OSD Before creating an OSD, the selected disk should be cleaned. ``` wipefs -a /dev/nvme1n1 sgdisk --zap-all /dev/nvme1n1 partprobe /dev/nvme1n1 ``` ceph->OSD->create OSD datacenter->pool->create poolvreate a pool and check that on datacenter->storage ### Purpose The shared storage layer is used to: - provide shared VM disk storage across `pve1`, `pve2`, and `pve3` - allow VMs to be migrated between Proxmox nodes - support Proxmox HA for selected VMs - avoid tying VM disks to only one physical host ### Current Ceph Configuration | Item | Value | |---|---| | Ceph nodes | `pve1`, `pve2`, `pve3` | | OSD layout | one OSD per Proxmox node | | RBD pool name | `vm-data` | | Pool size | `3` | | Pool min_size | `2` | | Usage | shared VM disk storage | The pool uses `size=3`, which means Ceph keeps three replicated copies of data. The `min_size=2` setting means the pool can continue operating as long as at least two replicas are available. ### VM Storage Policy Most VM disks are stored on the Ceph RBD pool `vm-data`. Exceptions: | VM / Service | Storage Policy | Reason | |---|---|---| | Proxmox Backup Server | Not stored on RBD | PBS is fixed on `pve2` and uses a dedicated disk/datastore | | Firewall | Not stored on RBD | Firewall storage is managed separately | | Other VMs | Stored on RBD | Supports migration and HA | ### HA Policy HA is enabled for VMs whose disks are stored on shared RBD storage. In this setup: - VMs stored on `vm-data` can be managed by Proxmox HA. - PBS is not HA-enabled because it is fixed on `pve2`. - Firewall is not included in this RBD-based HA setup. - HA behavior depends on the VM disk being stored on shared storage. Before enabling HA for a VM, verify that its disks are stored on `vm-data` instead of local storage. ### Ceph Components | Component | Description | |---|---| | Ceph MON | Maintains Ceph cluster state | | Ceph OSD | Stores actual data on physical disks | | Ceph Pool | Logical storage pool created on top of OSDs | | RBD | Block storage interface used by Proxmox for VM disks | ### Failure and HA Test We tested the HA behavior under failure conditions. Tested scenarios: 1. One Proxmox node lost network connectivity. 2. One Proxmox node was unexpectedly shut down. 3. one disk down(to be done) Result: - Proxmox HA was able to detect the failed node. - HA-managed VMs were restarted on another available Proxmox node. - VMs using the shared RBD storage were able to continue operating after failover. - The test confirmed that the Ceph RBD shared storage and Proxmox HA setup work correctly for the tested failure cases. This confirms that the current shared storage design supports HA failover for VMs stored on vm-data. ## backup Proxmox Backup Server (PBS) is used as the centralized backup service for virtual machines in the Proxmox cluster. In this infrastructure, PBS is deployed as a VM on `pve2`. It stores VM backups from the Proxmox cluster and provides backup management through the PBS web interface and Proxmox Datacenter integration. ### Deployment Notes we choose create a PBS VM on pve2 and select /dev/nvme2n1 as storage first clean the disk then ```bash ls -l /dev/disk/by-id/ | grep nvme#checkID qm set 112 -scsi1 /dev/disk/by-id/nvme-ID ``` 接著進去VM```lsblk```一下看到有```/dev/sdb```做好檔案系統後mount上去然後改fstab,接著 ```bash proxmox-backup-manager datastore create backupstore /mnt/datastore/backupstore proxmox-backup-manager datastore list#check ``` 然後在datacenter->storage把她加成一個proxmox backup storage在backup裡面把rule加進來 ### PBS Overview | Item | Value | |---|---| | Service | Proxmox Backup Server | | Hostname | `backup` | | IP | `172.16.127.112` | | VM ID | `112` | | VM location | `pve2` | | HA | Not enabled | | Datastore name | `backupstore` | | Datastore path | `/mnt/datastore/backupstore` | | Backup schedule | Daily at 21:00 | | Retention policy | Keep backups from the last 7 days | | Backup scope | All VMs | ### Design Decision PBS is deployed as a VM on `pve2`. Because PBS itself is the backup target, it is not stored on the Ceph RBD shared storage and is not managed by Proxmox HA. Instead, PBS uses a dedicated disk attached to the PBS VM as its datastore. This design keeps backup storage separate from the Ceph shared VM storage layer. However, PBS is a single point of failure in the current design. If `pve2` is unavailable, PBS will also be unavailable and scheduled backup jobs cannot run until `pve2` and the PBS VM are restored. ### Storage Design PBS uses a dedicated disk attached to the PBS VM. The disk is mounted inside the PBS system and used as the backup datastore. ```text pve2 └── PBS VM: backup / 172.16.127.112 └── dedicated disk └── /mnt/datastore/backupstore └── PBS datastore: backupstore ``` #### backup policy | Item | Value | | ------------- | --------------------------------- | | Backup target | PBS datastore `backupstore` | | Backup server | `172.16.127.112` | | Schedule | Daily at 21:00 | | Scope | All VMs | | Retention | Keep backups from the last 7 days | ### flow ``` All VMs on Proxmox Cluster │ v Daily backup job at 21:00 │ v PBS VM on pve2 │ v PBS datastore: backupstore │ v /mnt/datastore/backupstore ``` ### Restore Status Restore has not been tested yet. This is an important remaining task. A backup system should not be considered fully verified until at least one restore test has been completed successfully. Recommended restore test: - Select a non-critical VM or create a temporary test VM. - Run a manual backup to PBS. - Restore the VM from PBS. - Boot the restored VM. - Verify network connectivity and service availability. - Document the restore result. ## log collection The logging system is built with Grafana Loki, Grafana, and Grafana Alloy. Loki is used as the log storage backend, Grafana is used for querying and visualizing logs, and Alloy is deployed on Proxmox nodes as the log collection agent. ### Deployment Notes create an ubuntu VM ```bash sudo mkdir -p /etc/apt/keyrings wget -q -O - https://apt.grafana.com/gpg.key \ | gpg --dearmor \ | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \ | sudo tee /etc/apt/sources.list.d/grafana.list sudo apt update sudo apt install loki sudo apt install grafana sudo systemctl enable grafana-server sudo systemctl start grafana-server ``` connect``` http://172.16.127.115:3000```login in as admin and change password.connection->add new connection add loki with IP 127.0.0.1:3100 #### install alloy on pve ```bash sudo mkdir -p /etc/apt/keyrings sudo wget -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key sudo chmod 644 /etc/apt/keyrings/grafana.asc echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list sudo apt-get update sudo apt-get install alloy ``` - ```/etc/alloy/config.alloy``` ``` logging { level = "info" format = "logfmt" } loki.source.journal "proxmox_journal" { max_age = "12h" relabel_rules = discovery.relabel.journal.rules forward_to = [loki.write.default.receiver] labels = { job = "systemd-journal", cluster = "proxmox-cluster", } } discovery.relabel "journal" { targets = [] rule { source_labels = ["__journal__hostname"] target_label = "host" } rule { source_labels = ["__journal__systemd_unit"] target_label = "unit" } rule { source_labels = ["__journal__transport"] target_label = "transport" } rule { source_labels = ["__journal__priority"] target_label = "priority" } } loki.write "default" { endpoint { url = "http://172.16.127.115:3100/loki/api/v1/push" } } ``` ``` logger "test from $(hostname) at $(date)"```測試 and we change ```/etc/loki/config.yml``` to ``` auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 log_level: debug grpc_server_max_concurrent_streams: 1000 common: instance_addr: 127.0.0.1 path_prefix: /var/lib/loki storage: filesystem: chunks_directory: /var/lib/loki/chunks rules_directory: /var/lib/loki/rules replication_factor: 1 ring: kvstore: store: inmemory query_range: results_cache: cache: embedded_cache: enabled: true max_size_mb: 100 limits_config: metric_aggregation_enabled: true enable_multi_variant_queries: true retention_period: 14d compactor: working_directory: /var/lib/loki/compactor retention_enabled: true delete_request_store: filesystem schema_config: configs: - from: 2020-10-24 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h pattern_ingester: enabled: true metric_aggregation: loki_address: localhost:3100 ruler: alertmanager_url: http://localhost:9093 frontend: encoding: protobuf # By default, Loki will send anonymous, but uniquely-identifiable usage and configuration # analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/ # # Statistics help us better understand how Loki is used, and they show us performance # levels for most users. This helps us prioritize features and documentation. # For more information on what's sent, look at # https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go # Refer to the buildReport method to see what goes into a report. # # If you would like to disable reporting, uncomment the following lines: #analytics: # reporting_enabled: false ``` ```bash #enable https sudo mkdir -p /etc/nginx/ssl sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ -keyout /etc/nginx/ssl/nasa3log.csie.org.key \ -out /etc/nginx/ssl/nasa3log.csie.org.crt \ -subj "/CN=nasa3log.csie.org" ``` ``` #/etc/nginx/sites-available/grafana server { listen 80; server_name nasa3log.csie.org; return 301 https://$host$request_uri; } server { listen 443 ssl http2; server_name nasa3log.csie.org; ssl_certificate /etc/nginx/ssl/nasa3log.csie.org.crt; ssl_certificate_key /etc/nginx/ssl/nasa3log.csie.org.key; location / { proxy_pass http://127.0.0.1:3000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto https; proxy_set_header X-Forwarded-Host $host; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; } } ``` ### Overview | Component | Host | IP | Role | |---|---|---:|---| | Grafana | `logging` | `172.16.127.115` | Log visualization and dashboard UI | | Loki | `logging` | `172.16.127.115` | Log storage backend | | Grafana Alloy | `pve1`, `pve2`, `pve3` | Proxmox nodes | Collects systemd journal logs and forwards them to Loki | ### Purpose The logging system is used to: - collect system logs from Proxmox nodes - centralize logs in Loki - query logs from Grafana - help debug Proxmox, Ceph, HA, and node-level issues - provide a foundation for future application log integration ### Current Scope Currently, Alloy is only installed on the three Proxmox nodes: - `pve1` - `pve2` - `pve3` The current logging scope is limited to Proxmox node systemd journal logs. | Source | Status | |---|---| | Proxmox node systemd journal logs | Enabled | | Ceph-related logs through journal | Available if written to systemd journal | | VM application logs | Not yet integrated | | External application logs | Not yet integrated | ### Architecture ```text Proxmox Nodes pve1 / pve2 / pve3 │ │ systemd journal logs v Grafana Alloy │ │ push logs v Loki 172.16.127.115:3100 │ │ query v Grafana 172.16.127.115:3000 ``` ### logging server | Item | Value | | -------------- | ---------------- | | Hostname | `logging` | | IP | `172.16.127.115` | | OS | Ubuntu | | Grafana port | `3000` | | Loki port | `3100` | | Loki retention | `14 days` | ### grafana | Item | Status | | ------------------------- | ------------------- | | Loki data source | Configured | | Log query through Explore | Available | | Log dashboard | Not finalized yet | | Application log dashboard | Not implemented yet | ### Grafana Alloy Grafana Alloy is installed on each Proxmox node. Alloy reads logs from the systemd journal and forwards them to Loki. In this setup, Alloy collects logs from Proxmox nodes and attaches useful labels such as hostname, systemd unit, transport, and priority. Alloy is not currently installed on project VMs or application servers. ### label | Label | Source | Meaning | | ----------- | ------------------------- | -------------------------------------------- | | `job` | static label | Log source type, currently `systemd-journal` | | `cluster` | static label | Cluster name, currently `proxmox-cluster` | | `host` | `__journal__hostname` | Hostname of the Proxmox node | | `unit` | `__journal__systemd_unit` | systemd unit name | | `transport` | `__journal__transport` | Journal transport type | | `priority` | `__journal__priority` | Syslog priority level | ### Example Queries explore->loki - Query all Proxmox journal logs: ```{job="systemd-journal"}``` - Search logs containing a keyword: ```{job="systemd-journal"} |= "error"``` - guide [query](https://grafana.com/docs/grafana/latest/visualizations/panels-visualizations/query-transform-data/expression-queries/) ### usage log is suggested writen as json format like```{"time":"2026-05-09 12:00:00","level":"INFO","logger":"app","message":"hello endpoint called"}``` level may contain ```INFO WRANING ERROR``` ```bash sudo mkdir -p /etc/apt/keyrings wget -q -O - https://apt.grafana.com/gpg.key \ | gpg --dearmor \ | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \ | sudo tee /etc/apt/sources.list.d/grafana.list sudo apt install alloy ``` then config your alloy at ```/etc/alloy/config.alloy``` ``` logging { level = "info" format = "logfmt" } loki.source.file "django_app" { targets = [ { __path__ = "", job = "", source = "", app = "", env = "", host = "", }, ] forward_to = [loki.write.default.receiver] } loki.write "default" { endpoint { url = "http://172.16.127.115:3100/loki/api/v1/push" } } ``` then ```bash sudo systemctl enable alloy sudo systemctl restart alloy journalctl -u alloy -n 100 --no-pager #check that no error ``` #### trouble shooting some error may occur when alloy doesn't have permission to check your log file, make sure your log file has right promission ## Monitoring System: Prometheus + node-exporter + Grafana The monitoring system is built with Prometheus, node-exporter, and Grafana. Prometheus is used to collect metrics, node-exporter is installed on Proxmox nodes to expose host-level metrics, and Grafana is used to visualize the metrics through dashboards. ### Deployment Notes on ubuntu ``` sudo apt install -y prometheus ``` and change ```/etc/prometheus/prometheus.yml ``` to ``` # Sample config for Prometheus. global: scrape_interval: 15s evaluation_interval: 15s external_labels: monitor: 'example' alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] rule_files: # - "first_rules.yml" # - "second_rules.yml" scrape_configs: - job_name: 'prometheus' scrape_interval: 5s scrape_timeout: 5s static_configs: - targets: ['localhost:9090'] - job_name: node static_configs: - targets: ['localhost:9100'] - job_name: 'proxmox-nodes' static_configs: - targets: - '172.16.127.16:9100' - '172.16.127.17:9100' - '172.16.127.18:9100' - job_name: 'ceph' static_configs: - targets: - '172.16.127.16:9283' - '172.16.127.17:9283' - '172.16.127.18:9283' ``` on pve ```bash sudo apt install -y prometheus-node-exporter ceph mgr module enable prometheus ``` ### Overview | Component | Host | IP | Role | |---|---|---:|---| | Prometheus | `logging` | `172.16.127.115` | Metrics collection and storage | | Grafana | `logging` | `172.16.127.115` | Metrics dashboard and visualization | | node-exporter | `pve1`, `pve2`, `pve3` | Proxmox nodes | Exposes host metrics | ### Purpose The monitoring system is used to: - monitor Proxmox node status - collect CPU, memory, disk, and network metrics - check whether Proxmox nodes are reachable - provide dashboards for infrastructure status - support future alerting and troubleshooting ### Current Scope Currently, node-exporter is installed on the Proxmox nodes. | Source | Status | |---|---| | Proxmox node metrics | Enabled | | VM metrics | Not yet integrated | | Application metrics | Not yet integrated | | Ceph metrics | Not fully integrated yet | | Alerting | Not yet configured | ### Architecture ```text Proxmox Nodes pve1 / pve2 / pve3 │ │ expose metrics on :9100 v node-exporter │ │ scraped by Prometheus v Prometheus 172.16.127.115:9090 │ │ queried by Grafana v Grafana 172.16.127.115:3000 ``` ### monitoring server | Item | Value | | ------------------ | ---------------- | | Hostname | `logging` | | IP | `172.16.127.115` | | Prometheus port | `9090` | | Grafana port | `3000` | | node-exporter port | `9100` | ### check it in grafana dashbord->ststem info ## Remote Backup > We currently does not have a remote storage, so these settings will be implemented as soon as we got one (預計google drive). The configuration below is tested on my machine. ### architecture - Use existing proxmox backup server and `rclone` to sync the encrypted backup files to remote once every week. - rclone will do the encryption, decryption, and retry automatically if uploading failed. - Schedule: `prune` 1:00, `gc`(garbage collection) 2:00, `rclone` 3:00 every day ### configuration #### google api ``` 待補 Obtain: client ID client password 需要發布應用程式以避免token過期 ``` #### pbs ``` curl https://rclone.org/install.sh | bash rclone config # create google drive remote (verify)>N, 去其他有瀏覽器的地方verify # create encrypt remote based on 1 # backup config file!!! ``` #### timer crontab -e ``` 0 3 * * * /usr/bin/rclone sync /mnt/datastore/backupstorage gdrive:pbs_backup --config ~/.config/rclone/rclone.conf --fast-list --checkers 64 --transfers 16 --delete-after --retries 3 --retries-sleep 10s >> /var/log/rclone-cron.log 2>&1 ``` - Currently use `--delete-after`. Be aware that the remote may MLE - `--checkers 64 --transfers 16` does not matter much based on local testing ### Known issue - Be aware of token expire problem