How I Implemented Observability in my Homelab

For a very long time, I have had a single homeserver running important services. In 2020, I added DNS resolution to the list of services that it provided, allowing for domains with nonexistent TLDs to resolve, as well as blackholing DNS requests tied to advertisement and tracking.

This setup worked well for a while. Its shortcoming was apparent when I performed maintenance on the homeserver itself. The blackhole DNS server going offline led a degradation in user experience for all devices connected to my home LAN and WAN.

Therefore, I decided to expand my homelab with 4 new RockPi nodes. I quickly learned that running docker container logs from the command line just doesn't scale well with swarm deployments, so I searched for a better way to do this.

Part 1: Basic Observability

At first, I needed something minimally intrustive just so that I could check on the status of nodes and deployments. I was already using Grafana in a single-node setup, but that node was not configured to receive logs from other nodes.

I tried Swarmpit and it worked, so I stuck with it for a few days to see if it was good enough to be an interim replacement for Grafana. It was definitely useful for looking at individual nodes that services were being run on, but it was hard to get a view of failures happening with the same service on multiple nodes.

Unsatisified with the user experience of Swarmpit, I relegated it to only being used while debugging the setup of Grafana.

Part 2: Enabling Queryable Logs

I had already set up Grafana, Promtail, and Loki once before, so I was able to just re-use the existing configuration. I used the following configuration to write all container logs to file:

{
    "log-driver": "json-file",
    "log-opts": {
      "max-size": "10m",
      "max-file": "3",
      "tag": "{{.ImageName}}|{{.Name}}|{{.ImageFullID}}|{{.FullID}}"
    }  
}

Then, I used the following configuration to parse the container logs:

scrape_configs:

- job_name: containers
  static_configs:
  - targets:
      - localhost
    labels:
      job: containerlogs
      __path__: /var/lib/docker/containers/*/*log
  pipeline_stages:
    - json:
        expressions:
          stream: stream
          attrs: attrs
          tag: attrs.tag
          log: log
    - regex:
        expression: (?P<image_name>(?:[^|@]*[^|@]))@(?P<image_sha>(?:[^|]*[^|]))\|(?P<stack_name>(?:[^_]*[^_]))_(?P<service_name>(?:[^\.]*[^\.]))\.(?P<replica>(?:[^|]*[^|]))\.(?P<task_id>(?:[^|]*[^|]))\|(?P<image_id>(?:[^|]*[^|]))
        source: "tag"
    - labels:
        stack_name:
        service_name:
        image_name:
        task_id:
        replica:
        stream:
    - template:
        source: message
        template: '{{ omit (mustFromJson .Entry) "attrs" "filename" "stream" | mustToJson }}'
    - labeldrop:
        - filename
        - log
    - output:
        source: message
    - pack:
        labels:
          - image_name
          - task_id
          - replica
          - stream

The distinction between which labels are passed to the pack stage is important. The fields in the pack section have unbounded cardinality; due to how Loki manages logs, the overall number of streams would be very high if they were labels instead. This is still useful information to have (e.g. for differentiating task instances on the same node), so it's stored here instead.

This solution worked well, but there was still one point of friction: it was hard to tell, at a glance, which logs were coming from which nodes. I anticipated a need to view logs by node (in the case of a single node experiencing a failure), so I added the hostname label to them with the following config:

clients:
  - url: <loki_host_url>/loki/api/v1/push
    external_labels:
      hostname: ${HOSTNAME}

After all of this, the logging behaved as expected. Below is a sample query.

Part 3: Automating Setup

Satisfied with the behavior of the log collection, I now had to figure out a way to easily implement this on all nodes, with minimal setup. After some research, I decided that Ansible would be my best bet.

The setup steps that I determined were necessary are as follows:

change the default hostname
enable ssh login via key
mount the SSD
update resolv.conf to use 192.168.1.1 as the nameserver
install basic packages for the workloads (wget, youtube-dl, etc)
install Docker
update Docker's daemon.json to write to file in the format outlined in part 2

Luckily, Ansible has builtin modules that can handle all of these things. I created separate playbooks for each logical grouping of operations, and a config file specifying the inventory_hostname that the nodes would take, along with their passwords.

One benefit that I did not anticipate was that, by using rock PIs, I could very quickly re-flash the entire OS and then re-run an Ansible playbook. It was very useful in trying to assert that the full series of playbooks worked in tandem.

Part 4: Collecting Metrics

Collecting logs is great, but when it's the only point of visibility, the only information that can be gleaned at-a-glance is whether the node is up or not. After some more research, I settled on cAdvisor and node_exporter to expose metrics, and Prometheus to collect them. Conveniently, there are also Grafana dashboards for them as well.

I then created scrape_configs in order to collect the metrics. In this case, solar is a node with its Docker socket exposed via http, and has manager status within the cluster. This configuration allows the instance label to be filled with the node hostname, increasing ease-of-use within the dashboards.

- job_name: 'cadvisor'
    dockerswarm_sd_configs:
      - host: "http://solar.infra.lab.qq:2375"
        role: "tasks"
        port: 8080

    relabel_configs:
      - source_labels: [__meta_dockerswarm_service_name]
        regex: ".*cadvisor.*"
        action: "keep"
      - source_labels: [__meta_dockerswarm_node_hostname]
        target_label: "instance"

  - job_name: 'node-exporter'
    dockerswarm_sd_configs:
      - host: "http://solar.infra.lab.qq:2375"
        role: "tasks" 
        port: 9100

    relabel_configs:
      - source_labels: [__meta_dockerswarm_service_name]
        regex: ".*monitoring_node-exporter.*"
        action: "keep"

      - source_labels: [__meta_dockerswarm_node_hostname]
        target_label: "instance"

alt alt

Part 5: Alerting

With metrics beings scraped, I now needed a mechanism to send alerts. I settled on a simple check of the up metric on the node-exporter scrape job. I defined alerts.node-down.title and connected it to a discord webhook.

{{ define "alerts.node-down.title" }}
{{- if .Alerts.Firing }}Node{{ if gt (len .Alerts.Firing) 1 }}s{{ end }} down! [{{- range $index, $alert := .Alerts.Firing -}}

{{ if gt $index 0 }}, {{ end }}{{ .Labels.instance }}
{{- end }}]
{{- end }}
{{ end }}

Which resulted in a webhook notification that looked like this:

Summary

To test the entire setup, end-to-end, I manually took a node offline (by unplugging it) and received an alert within one minute. With that, I considered the observability setup to be successful!

The overall observability architecture ended up looking like this:

graph LR


promtail
prometheus
loki
mimir
minio

prometheus --> mimir --/mimir--> minio
promtail --> loki --/loki--> minio

grafana
mimir --> grafana
loki --> grafana