# hardware infra document
[工作紀錄](https://hackmd.io/SxOoymmkSNiYl8lURLt2tA)
## System Architecture

Our infrastructure is built on a three-node Proxmox cluster. The cluster provides the main virtualization platform for project services and internal virtual machines. In addition to the Proxmox cluster, we deploy dedicated infrastructure services for shared storage, backup, logging, and monitoring.

### Core Components

| Component | Hostname | IP | Role |
|---|---|---:|---|
| Proxmox Node 1 | pve1 | 172.16.127.16 | Proxmox cluster node |
| Proxmox Node 2 | pve2 | 172.16.127.17 | Proxmox cluster node; hosts the PBS VM |
| Proxmox Node 3 | pve3 | 172.16.127.18 | Proxmox cluster node |
| Proxmox Backup Server | backup | 172.16.127.112 | Backup service for Proxmox VMs |
| Logging / Monitoring Server | logging | 172.16.127.115 | Grafana, Loki, and Prometheus |

### High-Level Design

The infrastructure is divided into four main layers:

1. **Virtualization Layer**

   The virtualization layer is provided by a three-node Proxmox cluster consisting of `pve1`, `pve2`, and `pve3`. These nodes are joined into the same Proxmox cluster and are managed under the same Proxmox datacenter view.

2. **Shared Storage Layer**

   Ceph is officially enabled on the Proxmox cluster. An RBD storage pool is created and used as shared storage for VM disks. This allows VM storage to be shared across the Proxmox nodes.

3. **Backup Layer**

   Proxmox Backup Server is deployed as a VM on `pve2`. The PBS VM is used to store and manage VM backups from the Proxmox cluster.

   Currently, the PBS VM is fixed on `pve2` and is not configured with High Availability. If `pve2` goes down, the backup service will not automatically migrate to another node.

4. **Logging and Monitoring Layer**

   Grafana, Loki, and Prometheus are deployed on the logging server.

   - Loki is used to store logs.
   - Grafana is used to visualize logs and metrics.
   - Prometheus is used to collect metrics.
   - Grafana Alloy is installed on Proxmox nodes to collect systemd journal logs and forward them to Loki.
   - node-exporter is installed on Proxmox nodes to expose host metrics to Prometheus.

### Architecture Diagram

```text
                           ┌─────────────────────────────┐
                           │        Proxmox Cluster      │
                           │                             │
                           │  ┌──────┐ ┌──────┐ ┌──────┐ │
                           │  │ pve1 │ │ pve2 │ │ pve3 │ │
                           │  └──────┘ └──────┘ └──────┘ │
                           └──────────────┬──────────────┘
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    │                     │                     │
                    v                     v                     v
          ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────────┐
          │   Ceph RBD      │   │ Proxmox Backup  │   │ Logging / Monitoring│
          │ Shared Storage  │   │ Server VM       │   │ Server              │
          │                 │   │                 │   │                     │
          │ VM disk storage │   │ runs on pve2    │   │ Grafana             │
          └─────────────────┘   │ no HA enabled   │   │ Loki                │
                                └─────────────────┘   │ Prometheus          │
                                                      └─────────────────────┘
```

### Known issue
- 有些機器不能自然重開機要過bios，推測前人沒刪乾淨 (所以sc組的若重開proxmox有機會需要去機房開QQ)
## Network Design

This infrastructure mainly uses an internal private network for Proxmox nodes, physical machines, and virtual machines. Public IP addresses are only used where external access is required.

### Internal Network

The internal network uses the following subnet:

| Item | Value |
|---|---|
| Internal subnet | `172.16.0.0/16` |
| Gateway | `172.16.0.1` |
| Project usable range | `172.16.127.0` - `172.16.127.200` |

The project mainly uses the `172.16.127.0/24` range for infrastructure machines, Proxmox nodes, and virtual machines. IP addresses after `172.16.127.200` are reserved for the SC machines.

### IP Allocation

| Range | Usage |
|---|---|
| `172.16.127.11` - `172.16.127.15` | SC physical machines |
| `172.16.127.16` - `172.16.127.18` | Proxmox cluster nodes |
| `172.16.127.101` and above | Project VMs and infrastructure services |
| `172.16.127.112` | Proxmox Backup Server |
| `172.16.127.115` | Logging / monitoring server |

### Important Infrastructure IPs

| Name | Hostname | IP | Usage |
|---|---|---:|---|
| SC1 | `sc1` | `172.16.127.11` | SC physical machine |
| SC2 | `sc2` | `172.16.127.12` | SC physical machine |
| SC3 | `sc3` | `172.16.127.13` | SC physical machine |
| SC4 | `sc4` | `172.16.127.14` | SC physical machine |
| SC5 | `sc5` | `172.16.127.15` | SC physical machine |
| RM1 | `pve1` | `172.16.127.16` | Proxmox node |
| RM2 | `pve2` | `172.16.127.17` | Proxmox node |
| RM3 | `pve3` | `172.16.127.18` | Proxmox node |
| PBS | `backup` | `172.16.127.112` | Backup server |
| Logging | `logging` | `172.16.127.115` | Grafana, Loki, Prometheus |

### Public Network

The public IP range is available through VLAN tagging.

| Item | Value |
|---|---|
| Public IP range | `140.112.187.48-55/27` |
| Public gateway | `140.112.187.62` |
| VLAN tag | `187` |

Currently, public IP addresses are only used by the firewall. Most internal services and VMs use the private `172.16.0.0/16` network.

### how to configure pve's IP
```bash
ip a
vi /etc/network/interfaces
systemctl restart networking
ip link set up ...
```
- `/etc/network/interfaces`
```
iface nic1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 172.16.127.12/16
    gateway 172.16.0.1
    bridge-ports nic1
    bridge-stp off
    bridge-fd 0
```

#### changing hostname
```bash
vi /etc/hosts ## sc1.nasa sc1
vi /etc/hostname ## sc1
reboot
```

### VM Network Design

All VMs are connected through the Proxmox Linux bridge `vmbr0`
```text
VM network interface
        │
        v
Proxmox vmbr0
        │
        v
Physical network interface
        │
        v
Internal network: 172.16.0.0/16
```

#### how to configure VM statis IP
A typical VM network configuration uses a static internal IP address, the internal gateway, and public DNS resolvers.
#### ubuntu
modify /etc/netplan/*.yaml
```
network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses:
        - 172.16.127.101/16
      routes:
        - to: default
          via: 172.16.0.1
      nameservers:
        addresses:
          - 8.8.8.8
          - 1.1.1.1
```
```bash
sudo netplan apply
```
#### arch linux
```bash
sudo nmcli connection modify "Wired connection 1" \
  ipv4.addresses 172.16.127.105/24 \
  ipv4.gateway 172.16.127.1 \
  ipv4.dns "8.8.8.8 1.1.1.1" \
  ipv4.method manual
```
#### windows(and installation guide)
1. install[windows iso](https://www.microsoft.com/zh-tw/software-download/windows11)and[driver iso](https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/?utm_source=chatgpt.com)
2. 再進create VM的時候記得裝網路驅動 如果沒裝可以在進windows之後再裝
3. 跳過連上網路步驟：shift+F10叫出cmd然後輸入`OOBE\BYPASSNRO`重啟
4. 去設定, ethernet, IP assignment, 設定Manual, 打開IPv4
    ```
    IP: 172.16.127.106
    subnet mask: 255.255.0.0
    gateway: 172.16.0.1
    DNS: 8.8.8.8
    ```
    
### VMs network and service
| IP | hostname | username | os |
| -- | -- | -- | -- |
| 172.16.127.101 | room1-nasa3 | room1 | ubuntu |
| 172.16.127.102 | mail1-nasa3 | mail1 | ubuntu |
| 172.16.127.103 | printu1-nasa3 | printu1 | ubuntu |
| 172.16.127.104 | database1-nasa3 | database1 | ubuntu |
| 172.16.127.105 | iden1-nasa3 | iden1 | arch |
| 172.16.127.106 | printw1-nasa3 | printw1 | windows |
| 172.16.127.107 | room2-nasa3 | room2 | ubuntu |
| 172.16.127.108 | room3-nasa3 | room3 | ubuntu |
| 172.16.127.109 | iden2-nasa3 | iden2 | arch |
| 172.16.127.110 | wifi1-nasa3 | wifi1 | ubuntu |
| 172.16.127.112 | backup | root | proxmox-backup-server |
| 172.16.127.113 | uiux1 | uiux1 | ubuntu |
| 172.16.127.114 | uiux2 | uiux2 | ubuntu |
| 172.16.127.115 | log | logging | ubuntu |
| 172.16.127.116 | mail2 | mail2 | ubuntu |
| 172.16.127.117 | mail3 | mail3 | ubuntu |
### Known issue
- USB NIC Instability
since the bios is in old version, some SC machines use USB network adapters(sc1, sc2, sc4). One known issue is that the USB NIC on sc1 (172.16.127.11) may automatically disconnect after being connected for a short time.

To mitigate this issue, USB autosuspend should be disabled through GRUB.

Edit ```/etc/default/grub``` and add:
```usbcore.autosuspend=-1``` eg.
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet usbcore.autosuspend=-1"
```
and then update grub 
```
sudo update-grub
```
- 有一些人有假的`enx...`，但我們沒有把假的介面刪掉
## Proxmox Cluster Setup

The virtualization platform is built as a three-node Proxmox cluster. The cluster was initially created on `pve1`, and `pve2` and `pve3` were later joined into the same cluster.

The cluster allows all Proxmox nodes to be managed under the same datacenter view and provides the base infrastructure for VM management, shared storage, backup, and monitoring.

### Cluster Overview

| Node | Hostname | IP | Role |
|---|---|---:|---|
| RM1 | `pve1` | `172.16.127.16` | Proxmox cluster node; initial cluster creation node |
| RM2 | `pve2` | `172.16.127.17` | Proxmox cluster node |
| RM3 | `pve3` | `172.16.127.18` | Proxmox cluster node |

All three nodes are connected through the internal network `172.16.0.0/16`.  
The Proxmox cluster communication uses the network interface configured on the Proxmox nodes.

### Cluster Creation

The cluster was created from `pve1`.

After the cluster was created, the other two Proxmox nodes were joined into the cluster:

```text
pve1 creates the cluster
pve2 joins the cluster
pve3 joins the cluster
```
commands
```bash
pvecm create rm-nasa
pvecm join 172.16.127.16 
```
Cluster Purpose

The Proxmox cluster is used to:

- manage all Proxmox nodes from a single datacenter view
- host project virtual machines
- provide the base environment for Ceph shared storage
- integrate with Proxmox Backup Server
- provide the foundation for HA and VM migration

HA and VM migration depend on shared storage and are documented in the shared storage section.
### Note
- USB network adapters should not be used for Proxmox cluster communication because they may be unstable. In this setup, USB NIC issues are limited to SC machines and are documented in the Network Design section.
- sou can use ```pvecm status``` to verify cluster's  health
### Known Issue
- Cluster Reinstallation

During earlier setup, the machine using 172.16.127.17 was replaced.
At that time, old cluster information could not be fully removed from pve1 and pve3, so all three Proxmox nodes were reinstalled and the cluster was rebuilt from scratch.

This issue is important for future maintenance:

avoid changing cluster node identity casually
keep hostname and IP address consistent
remove a node cleanly before replacing it
verify cluster state before rejoining a replaced node
if old cluster metadata cannot be removed cleanly, reinstallation may be required
## Share storage and HA
Ceph is used as the shared storage backend for the Proxmox cluster.  
All three Proxmox nodes participate in the Ceph cluster, and each Proxmox node provides one Ceph OSD.

A Ceph RBD pool named `vm-data` is used as shared VM disk storage.

### working history
#### setup flow
```
1. Enable Proxmox no-subscription repository
2. Install Ceph on all Proxmox nodes
3. Create Ceph monitors
4. Prepare one OSD disk on each Proxmox node
5. Create OSDs
6. Create the Ceph pool vm-data
7. Set pool size=3 and min_size=2
8. Add vm-data to Proxmox as RBD storage
9. Move VM disks to vm-data when HA/migration is required
10. Enable HA for VMs stored on vm-data
11. Test VM migration and HA failover
```
#### install ceph
Before installing Ceph, the Proxmox no-subscription repositories should be enabled.

Example repository configuration:
```bash
mv /etc/apt/sources.list.d/pve-enterprise.sources /etc/apt/sources.list.d/pve-enterprise.sources.bak
mv /etc/apt/sources.list.d/ceph.sources /etc/apt/sources.list.d/ceph.sources.bak
cat >/etc/apt/sources.list.d/proxmox.sources <<'EOF'
Types: deb
URIs: http://download.proxmox.com/debian/pve
Suites: trixie
Components: pve-no-subscription

Types: deb
URIs: http://download.proxmox.com/debian/ceph-squid
Suites: trixie
Components: no-subscription
EOF
apt update
apt upgrade
```
When installing Ceph from the Proxmox web UI, choose the ```no-subscription``` repository.
then install monitor in webgui.Then install osd
#### instal OSD
Before creating an OSD, the selected disk should be cleaned.
```
wipefs -a /dev/nvme1n1
sgdisk --zap-all /dev/nvme1n1
partprobe /dev/nvme1n1
```
ceph->OSD->create OSD
datacenter->pool->create poolvreate a pool and check that on datacenter->storage

### Purpose

The shared storage layer is used to:

- provide shared VM disk storage across `pve1`, `pve2`, and `pve3`
- allow VMs to be migrated between Proxmox nodes
- support Proxmox HA for selected VMs
- avoid tying VM disks to only one physical host

### Current Ceph Configuration

| Item | Value |
|---|---|
| Ceph nodes | `pve1`, `pve2`, `pve3` |
| OSD layout | one OSD per Proxmox node |
| RBD pool name | `vm-data` |
| Pool size | `3` |
| Pool min_size | `2` |
| Usage | shared VM disk storage |

The pool uses `size=3`, which means Ceph keeps three replicated copies of data.  
The `min_size=2` setting means the pool can continue operating as long as at least two replicas are available.
### VM Storage Policy

Most VM disks are stored on the Ceph RBD pool `vm-data`.

Exceptions:

| VM / Service | Storage Policy | Reason |
|---|---|---|
| Proxmox Backup Server | Not stored on RBD | PBS is fixed on `pve2` and uses a dedicated disk/datastore |
| Firewall | Not stored on RBD | Firewall storage is managed separately |
| Other VMs | Stored on RBD | Supports migration and HA |

### HA Policy

HA is enabled for VMs whose disks are stored on shared RBD storage.

In this setup:

- VMs stored on `vm-data` can be managed by Proxmox HA.
- PBS is not HA-enabled because it is fixed on `pve2`.
- Firewall is not included in this RBD-based HA setup.
- HA behavior depends on the VM disk being stored on shared storage.

Before enabling HA for a VM, verify that its disks are stored on `vm-data` instead of local storage.
### Ceph Components

| Component | Description |
|---|---|
| Ceph MON | Maintains Ceph cluster state |
| Ceph OSD | Stores actual data on physical disks |
| Ceph Pool | Logical storage pool created on top of OSDs |
| RBD | Block storage interface used by Proxmox for VM disks |

### Failure and HA Test

We tested the HA behavior under failure conditions.

Tested scenarios:

1. One Proxmox node lost network connectivity.
2. One Proxmox node was unexpectedly shut down.
3. one disk down(to be done)

Result:

- Proxmox HA was able to detect the failed node.
- HA-managed VMs were restarted on another available Proxmox node.
- VMs using the shared RBD storage were able to continue operating after failover.
- The test confirmed that the Ceph RBD shared storage and Proxmox HA setup work correctly for the tested failure cases.

This confirms that the current shared storage design supports HA failover for VMs stored on vm-data.
## backup

Proxmox Backup Server (PBS) is used as the centralized backup service for virtual machines in the Proxmox cluster.

In this infrastructure, PBS is deployed as a VM on `pve2`. It stores VM backups from the Proxmox cluster and provides backup management through the PBS web interface and Proxmox Datacenter integration.

### Deployment Notes
we choose create a PBS VM on pve2 and select /dev/nvme2n1 as storage 
first clean the disk then 
```bash
ls -l /dev/disk/by-id/ | grep nvme#checkID
qm set 112 -scsi1 /dev/disk/by-id/nvme-ID
```
接著進去VM```lsblk```一下看到有```/dev/sdb```做好檔案系統後mount上去然後改fstab，接著
```bash
proxmox-backup-manager datastore create backupstore /mnt/datastore/backupstore
proxmox-backup-manager datastore list#check
```
然後在datacenter->storage把她加成一個proxmox backup storage在backup裡面把rule加進來

### PBS Overview

| Item | Value |
|---|---|
| Service | Proxmox Backup Server |
| Hostname | `backup` |
| IP | `172.16.127.112` |
| VM ID | `112` |
| VM location | `pve2` |
| HA | Not enabled |
| Datastore name | `backupstore` |
| Datastore path | `/mnt/datastore/backupstore` |
| Backup schedule | Daily at 21:00 |
| Retention policy | Keep backups from the last 7 days |
| Backup scope | All VMs |

### Design Decision

PBS is deployed as a VM on `pve2`.

Because PBS itself is the backup target, it is not stored on the Ceph RBD shared storage and is not managed by Proxmox HA. Instead, PBS uses a dedicated disk attached to the PBS VM as its datastore.

This design keeps backup storage separate from the Ceph shared VM storage layer.

However, PBS is a single point of failure in the current design. If `pve2` is unavailable, PBS will also be unavailable and scheduled backup jobs cannot run until `pve2` and the PBS VM are restored.

### Storage Design

PBS uses a dedicated disk attached to the PBS VM. The disk is mounted inside the PBS system and used as the backup datastore.

```text
pve2
 └── PBS VM: backup / 172.16.127.112
      └── dedicated disk
           └── /mnt/datastore/backupstore
                └── PBS datastore: backupstore
```
#### backup policy
| Item          | Value                             |
| ------------- | --------------------------------- |
| Backup target | PBS datastore `backupstore`       |
| Backup server | `172.16.127.112`                  |
| Schedule      | Daily at 21:00                    |
| Scope         | All VMs                           |
| Retention     | Keep backups from the last 7 days |
### flow
```
All VMs on Proxmox Cluster
        │
        v
Daily backup job at 21:00
        │
        v
PBS VM on pve2
        │
        v
PBS datastore: backupstore
        │
        v
/mnt/datastore/backupstore
```
### Restore Status

Restore has not been tested yet.

This is an important remaining task. A backup system should not be considered fully verified until at least one restore test has been completed successfully.

Recommended restore test:

- Select a non-critical VM or create a temporary test VM.
- Run a manual backup to PBS.
- Restore the VM from PBS.
- Boot the restored VM.
- Verify network connectivity and service availability.
- Document the restore result.
## log collection

The logging system is built with Grafana Loki, Grafana, and Grafana Alloy.

Loki is used as the log storage backend, Grafana is used for querying and visualizing logs, and Alloy is deployed on Proxmox nodes as the log collection agent.

### Deployment Notes
create an ubuntu VM
```bash
sudo mkdir -p /etc/apt/keyrings

wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
  
sudo apt update
sudo apt install loki

sudo apt install grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
```
connect``` http://172.16.127.115:3000```login in as admin and change password.connection->add new connection add loki with IP 127.0.0.1:3100
#### install alloy on pve
```bash
sudo mkdir -p /etc/apt/keyrings
sudo wget -O /etc/apt/keyrings/grafana.asc https://apt.grafana.com/gpg-full.key
sudo chmod 644 /etc/apt/keyrings/grafana.asc
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install alloy
```
- ```/etc/alloy/config.alloy```
```
logging {
  level  = "info"
  format = "logfmt"
}

loki.source.journal "proxmox_journal" {
  max_age       = "12h"
  relabel_rules = discovery.relabel.journal.rules
  forward_to    = [loki.write.default.receiver]

  labels = {
    job     = "systemd-journal",
    cluster = "proxmox-cluster",
  }
}

discovery.relabel "journal" {
  targets = []

  rule {
    source_labels = ["__journal__hostname"]
    target_label  = "host"
  }

  rule {
    source_labels = ["__journal__systemd_unit"]
    target_label  = "unit"
  }

  rule {
    source_labels = ["__journal__transport"]
    target_label  = "transport"
  }

  rule {
    source_labels = ["__journal__priority"]
    target_label  = "priority"
  }
}

loki.write "default" {
  endpoint {
    url = "http://172.16.127.115:3100/loki/api/v1/push"
  }
}
```
``` logger "test from $(hostname) at $(date)"```測試
and we change ```/etc/loki/config.yml``` to
```
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: debug
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  enable_multi_variant_queries: true
  retention_period: 14d

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_store: filesystem

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf


# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false
```
```bash
#enable https
sudo mkdir -p /etc/nginx/ssl
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout /etc/nginx/ssl/nasa3log.csie.org.key \
  -out /etc/nginx/ssl/nasa3log.csie.org.crt \
  -subj "/CN=nasa3log.csie.org"
```
```
#/etc/nginx/sites-available/grafana
server {
    listen 80;
    server_name nasa3log.csie.org;

    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name nasa3log.csie.org;

    ssl_certificate     /etc/nginx/ssl/nasa3log.csie.org.crt;
    ssl_certificate_key /etc/nginx/ssl/nasa3log.csie.org.key;

    location / {
        proxy_pass http://127.0.0.1:3000;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header X-Forwarded-Host $host;

        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}
```
### Overview

| Component | Host | IP | Role |
|---|---|---:|---|
| Grafana | `logging` | `172.16.127.115` | Log visualization and dashboard UI |
| Loki | `logging` | `172.16.127.115` | Log storage backend |
| Grafana Alloy | `pve1`, `pve2`, `pve3` | Proxmox nodes | Collects systemd journal logs and forwards them to Loki |

### Purpose

The logging system is used to:

- collect system logs from Proxmox nodes
- centralize logs in Loki
- query logs from Grafana
- help debug Proxmox, Ceph, HA, and node-level issues
- provide a foundation for future application log integration

### Current Scope

Currently, Alloy is only installed on the three Proxmox nodes:

- `pve1`
- `pve2`
- `pve3`

The current logging scope is limited to Proxmox node systemd journal logs.

| Source | Status |
|---|---|
| Proxmox node systemd journal logs | Enabled |
| Ceph-related logs through journal | Available if written to systemd journal |
| VM application logs | Not yet integrated |
| External application logs | Not yet integrated |

### Architecture

```text
Proxmox Nodes
 pve1 / pve2 / pve3
        │
        │ systemd journal logs
        v
Grafana Alloy
        │
        │ push logs
        v
Loki
172.16.127.115:3100
        │
        │ query
        v
Grafana
172.16.127.115:3000
```
### logging server
| Item           | Value            |
| -------------- | ---------------- |
| Hostname       | `logging`        |
| IP             | `172.16.127.115` |
| OS             | Ubuntu           |
| Grafana port   | `3000`           |
| Loki port      | `3100`           |
| Loki retention | `14 days`        |
### grafana
| Item                      | Status              |
| ------------------------- | ------------------- |
| Loki data source          | Configured          |
| Log query through Explore | Available           |
| Log dashboard             | Not finalized yet   |
| Application log dashboard | Not implemented yet |
### Grafana Alloy

Grafana Alloy is installed on each Proxmox node.

Alloy reads logs from the systemd journal and forwards them to Loki. In this setup, Alloy collects logs from Proxmox nodes and attaches useful labels such as hostname, systemd unit, transport, and priority.

Alloy is not currently installed on project VMs or application servers.
### label
| Label       | Source                    | Meaning                                      |
| ----------- | ------------------------- | -------------------------------------------- |
| `job`       | static label              | Log source type, currently `systemd-journal` |
| `cluster`   | static label              | Cluster name, currently `proxmox-cluster`    |
| `host`      | `__journal__hostname`     | Hostname of the Proxmox node                 |
| `unit`      | `__journal__systemd_unit` | systemd unit name                            |
| `transport` | `__journal__transport`    | Journal transport type                       |
| `priority`  | `__journal__priority`     | Syslog priority level                        |
### Example Queries
explore->loki
- Query all Proxmox journal logs: ```{job="systemd-journal"}```
- Search logs containing a keyword: ```{job="systemd-journal"} |= "error"```
- guide
[query](https://grafana.com/docs/grafana/latest/visualizations/panels-visualizations/query-transform-data/expression-queries/)

### usage
log is suggested writen as json format like```{"time":"2026-05-09 12:00:00","level":"INFO","logger":"app","message":"hello endpoint called"}```
level may contain ```INFO WRANING ERROR```
```bash
sudo mkdir -p /etc/apt/keyrings

wget -q -O - https://apt.grafana.com/gpg.key \
  | gpg --dearmor \
  | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
  | sudo tee /etc/apt/sources.list.d/grafana.list
  
sudo apt install alloy

```
then config your alloy at ```/etc/alloy/config.alloy```
```
logging {
  level  = "info"
  format = "logfmt"
}

loki.source.file "django_app" {
  targets = [
    {
      __path__ = "<path to log file>",
      job      = "<job name>",
      source   = "<source>",
      app      = "<application>",
      env      = "<env>",
      host     = "<host>",
    },
  ]

  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://172.16.127.115:3100/loki/api/v1/push"
  }
}
```
then
```bash
sudo systemctl enable alloy
sudo systemctl restart alloy
journalctl -u alloy -n 100 --no-pager #check that no error
```
#### trouble shooting
some error may occur when alloy doesn't have permission to check your log file, make sure your log file has right promission
## Monitoring System: Prometheus + node-exporter + Grafana

The monitoring system is built with Prometheus, node-exporter, and Grafana.

Prometheus is used to collect metrics, node-exporter is installed on Proxmox nodes to expose host-level metrics, and Grafana is used to visualize the metrics through dashboards.
### Deployment Notes
on ubuntu
```
sudo apt install -y prometheus
```
and change ```/etc/prometheus/prometheus.yml ``` to
```
# Sample config for Prometheus.

global:
  scrape_interval:     15s
  evaluation_interval: 15s

  external_labels:
      monitor: 'example'

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    scrape_timeout: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
          - '172.16.127.16:9100'
          - '172.16.127.17:9100'
          - '172.16.127.18:9100'
  - job_name: 'ceph'
    static_configs:
      - targets:
          - '172.16.127.16:9283'
          - '172.16.127.17:9283'
          - '172.16.127.18:9283'
```
on pve
```bash
sudo apt install -y prometheus-node-exporter
ceph mgr module enable prometheus
```

### Overview

| Component | Host | IP | Role |
|---|---|---:|---|
| Prometheus | `logging` | `172.16.127.115` | Metrics collection and storage |
| Grafana | `logging` | `172.16.127.115` | Metrics dashboard and visualization |
| node-exporter | `pve1`, `pve2`, `pve3` | Proxmox nodes | Exposes host metrics |

### Purpose

The monitoring system is used to:

- monitor Proxmox node status
- collect CPU, memory, disk, and network metrics
- check whether Proxmox nodes are reachable
- provide dashboards for infrastructure status
- support future alerting and troubleshooting

### Current Scope

Currently, node-exporter is installed on the Proxmox nodes.

| Source | Status |
|---|---|
| Proxmox node metrics | Enabled |
| VM metrics | Not yet integrated |
| Application metrics | Not yet integrated |
| Ceph metrics | Not fully integrated yet |
| Alerting | Not yet configured |

### Architecture

```text
Proxmox Nodes
 pve1 / pve2 / pve3
        │
        │ expose metrics on :9100
        v
node-exporter
        │
        │ scraped by Prometheus
        v
Prometheus
172.16.127.115:9090
        │
        │ queried by Grafana
        v
Grafana
172.16.127.115:3000
```
### monitoring server
| Item               | Value            |
| ------------------ | ---------------- |
| Hostname           | `logging`        |
| IP                 | `172.16.127.115` |
| Prometheus port    | `9090`           |
| Grafana port       | `3000`           |
| node-exporter port | `9100`           |
### check it in grafana
dashbord->ststem info

## Remote Backup
> We currently does not have a remote storage, so these settings will be implemented as soon as we got one (預計google drive). The configuration below is tested on my machine.
### architecture
- Use existing proxmox backup server and `rclone` to sync the encrypted backup files to remote once every week.
- rclone will do the encryption, decryption, and retry automatically if uploading failed.
- Schedule: `prune` 1:00, `gc`(garbage collection) 2:00, `rclone` 3:00 every day

### configuration
#### google api
```
待補
Obtain:
client ID
client password
需要發布應用程式以避免token過期
```

#### pbs

```
curl https://rclone.org/install.sh | bash
rclone config
# create google drive remote
(verify)>N, 去其他有瀏覽器的地方verify

# create encrypt remote based on 1
# backup config file!!!
```

#### timer
crontab -e
```
0 3 * * * /usr/bin/rclone sync /mnt/datastore/backupstorage gdrive:pbs_backup --config ~/.config/rclone/rclone.conf --fast-list --checkers 64 --transfers 16 --delete-after --retries 3 --retries-sleep 10s >> /var/log/rclone-cron.log 2>&1
```
- Currently use `--delete-after`. Be aware that the remote may MLE
- `--checkers 64 --transfers 16` does not matter much based on local testing

### Known issue
- Be aware of token expire problem