How I Implemented Observability in my Homelab
For a very long time, I have had a single homeserver running important services. In 2020, I added DNS resolution to the list of services that it provided, allowing for domains with nonexistent TLDs to resolve, as well as blackholing DNS requests tied to advertisement and tracking.
This setup worked well for a while. Its shortcoming was apparent when I performed maintenance on the homeserver itself. The blackhole DNS server going offline led a degradation in user experience for all devices connected to my home LAN and WAN.
Therefore, I decided to expand my homelab with 4 new RockPi nodes.
I quickly learned that running docker container logs
from the command line just doesn't scale well with swarm deployments, so I searched for a better way to do this.
Part 1: Basic Observability
At first, I needed something minimally intrustive just so that I could check on the status of nodes and deployments. I was already using Grafana in a single-node setup, but that node was not configured to receive logs from other nodes.
I tried Swarmpit and it worked, so I stuck with it for a few days to see if it was good enough to be an interim replacement for Grafana. It was definitely useful for looking at individual nodes that services were being run on, but it was hard to get a view of failures happening with the same service on multiple nodes.
Unsatisified with the user experience of Swarmpit, I relegated it to only being used while debugging the setup of Grafana.
Part 2: Enabling Queryable Logs
I had already set up Grafana, Promtail, and Loki once before, so I was able to just re-use the existing configuration. I used the following configuration to write all container logs to file:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3",
"tag": "{{.ImageName}}|{{.Name}}|{{.ImageFullID}}|{{.FullID}}"
}
}
Then, I used the following configuration to parse the container logs:
scrape_configs:
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: containerlogs
__path__: /var/lib/docker/containers/*/*log
pipeline_stages:
- json:
expressions:
stream: stream
attrs: attrs
tag: attrs.tag
log: log
- regex:
expression: (?P<image_name>(?:[^|@]*[^|@]))@(?P<image_sha>(?:[^|]*[^|]))\|(?P<stack_name>(?:[^_]*[^_]))_(?P<service_name>(?:[^\.]*[^\.]))\.(?P<replica>(?:[^|]*[^|]))\.(?P<task_id>(?:[^|]*[^|]))\|(?P<image_id>(?:[^|]*[^|]))
source: "tag"
- labels:
stack_name:
service_name:
image_name:
task_id:
replica:
stream:
- template:
source: message
template: '{{ omit (mustFromJson .Entry) "attrs" "filename" "stream" | mustToJson }}'
- labeldrop:
- filename
- log
- output:
source: message
- pack:
labels:
- image_name
- task_id
- replica
- stream
The distinction between which labels are passed to the pack
stage is important. The fields in the pack
section have unbounded cardinality; due to how Loki manages logs, the overall number of streams would be very high if they were labels instead. This is still useful information to have (e.g. for differentiating task instances on the same node), so it's stored here instead.
This solution worked well, but there was still one point of friction: it was hard to tell, at a glance, which logs were coming from which nodes. I anticipated a need to view logs by node (in the case of a single node experiencing a failure), so I added the hostname
label to them with the following config:
clients:
- url: <loki_host_url>/loki/api/v1/push
external_labels:
hostname: ${HOSTNAME}
After all of this, the logging behaved as expected. Below is a sample query.
Part 3: Automating Setup
Satisfied with the behavior of the log collection, I now had to figure out a way to easily implement this on all nodes, with minimal setup. After some research, I decided that Ansible would be my best bet.
The setup steps that I determined were necessary are as follows:
- change the default hostname
- enable ssh login via key
- mount the SSD
- update
resolv.conf
to use192.168.1.1
as the nameserver - install basic packages for the workloads (wget, youtube-dl, etc)
- install Docker
- update Docker's daemon.json to write to file in the format outlined in part 2
Luckily, Ansible has builtin modules that can handle all of these things.
I created separate playbooks for each logical grouping of operations, and a config file specifying the inventory_hostname
that the nodes would take, along with their passwords.
One benefit that I did not anticipate was that, by using rock PIs, I could very quickly re-flash the entire OS and then re-run an Ansible playbook. It was very useful in trying to assert that the full series of playbooks worked in tandem.
Part 4: Collecting Metrics
Collecting logs is great, but when it's the only point of visibility, the only information that can be gleaned at-a-glance is whether the node is up or not. After some more research, I settled on cAdvisor and node_exporter to expose metrics, and Prometheus to collect them. Conveniently, there are also Grafana dashboards for them as well.
I then created scrape_configs
in order to collect the metrics. In this case, solar
is a node with its Docker socket exposed via http, and has manager
status within the cluster.
This configuration allows the instance
label to be filled with the node hostname, increasing ease-of-use within the dashboards.
- job_name: 'cadvisor'
dockerswarm_sd_configs:
- host: "http://solar.infra.lab.qq:2375"
role: "tasks"
port: 8080
relabel_configs:
- source_labels: [__meta_dockerswarm_service_name]
regex: ".*cadvisor.*"
action: "keep"
- source_labels: [__meta_dockerswarm_node_hostname]
target_label: "instance"
- job_name: 'node-exporter'
dockerswarm_sd_configs:
- host: "http://solar.infra.lab.qq:2375"
role: "tasks"
port: 9100
relabel_configs:
- source_labels: [__meta_dockerswarm_service_name]
regex: ".*monitoring_node-exporter.*"
action: "keep"
- source_labels: [__meta_dockerswarm_node_hostname]
target_label: "instance"
Part 5: Alerting
With metrics beings scraped, I now needed a mechanism to send alerts.
I settled on a simple check of the up
metric on the node-exporter
scrape job.
I defined alerts.node-down.title
and connected it to a discord webhook.
{{ define "alerts.node-down.title" }}
{{- if .Alerts.Firing }}Node{{ if gt (len .Alerts.Firing) 1 }}s{{ end }} down! [{{- range $index, $alert := .Alerts.Firing -}}
{{ if gt $index 0 }}, {{ end }}{{ .Labels.instance }}
{{- end }}]
{{- end }}
{{ end }}
Which resulted in a webhook notification that looked like this:
Summary
To test the entire setup, end-to-end, I manually took a node offline (by unplugging it) and received an alert within one minute. With that, I considered the observability setup to be successful!
The overall observability architecture ended up looking like this:
graph LR promtail prometheus loki mimir minio prometheus ---> mimir --/mimir--> minio promtail --> loki --/loki--> minio grafana loki --> grafana prometheus --> grafana