A short howto

What do we mean here with “Controling a Hetzner Failover IP from Kubernetes”? It means that we have a Pod that is checking whether the failover-IP is pinging and in case it’s not it triggers a failover of the IP to some other host that should receive the traffic of that failover-IP.

Let’s do this!

Create the docker image for the failover-ip-manager pod:

$ cat Dockerfile
# Usage:
#
#     docker run --volume config:/heartbeat/config
#                --volume logs:/heartbeat/log
#                -e HEARBEAT_LOG=STDOUT


# https://hub.docker.com/_/ruby/
FROM ruby:3.0-alpine

MAINTAINER "Tomáš Pospíšek" <tpo_deb@sourcepole.ch>

RUN echo "force update 0" && \
    apk update && \
    apk upgrade && \
    apk --update add git

RUN git clone https://github.com/mrkamel/heartbeat

WORKDIR heartbeat

# install heartbeat dependencies
RUN bundle

COPY heartbeat-api-health-check /

ENTRYPOINT ["sh", "-c"]
CMD ["cd /heartbeat && bin/heartbeat"]

We are using Benjamin Vetter’s nice heartbeat application to do the IP monitoring and the failover for us.

Next let’s create the k8s deployment.

One problem with Hetzner’s failover API is that you first have to register all the IPs that are allowed to access Hetzner’s Robot API. The latter is used to trigger the IP failover.

Kubernetes clusters are inherently dynamic, so it might happen, that the pod that controls the IP gets restarted on a different node, with a different IP or a new node gets created with a different IP… Thus we need to make sure that:

  1. the failover-ip-manager pod won’t switch nodes randomly:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: failover-ip-manager
      namespace: [...]
      labels:
        app: failover-ip-manager
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: failover-ip-manager
      template:
        metadata:
          name: failover-ip-manager
          labels:
            app: failover-ip-manager
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                    - THE_NODE_WHERE_YOU_WANT_TO_PIN_THE_POD_TO
    
  2. we notice if the failover-ip-manager pod for whatever reason gets a different IP, that maybe isn’t registered with Hetzner…

    So let’s make a readinessProbe for k8s, that will chec whether we can access the Hetzner Robot API periodically:

     spec:
       affinity:
         [...]
       containers:
       - image: YOUR_IMAGE_HERE
         name: failover-ip-manager
         env:
           - name: HEARTBEAT_LOG
             value: "STDOUT"
         volumeMounts:
         - name: heartbeat-config
           mountPath: /heartbeat/config/heartbeat.yml
           # mount single file
           subPath: heartbeat.yml
         - name: heartbeat-api-health-check-config
           mountPath: /heartbeat/heartbeat-api-health-check.yml
           # mount single file
           subPath: heartbeat-api-health-check.yml
         resources:
           requests:
             memory: 45M # this should suffice
         readinessProbe:
           failureThreshold: 1
           exec:
             command:
             - /heartbeat-api-health-check
           # periodSeconds seems to be broken, see
           # the delay is implemented in heartbeat-api-health-check instead
           #periodSeconds: 21600 # every 6h
           periodSeconds: 10
           timeoutSeconds: 10
    

    And have a script that implements the readinessProbe (it gets installed above via the Dockerfile):

    #!/bin/sh
    #
    # This is meant as a readynessProbe for k8s. It
    # checks whether heartbeat can access the
    # Hetzner Failover API
    #
    
    # We only want to check once every 6 hours whether the API is available.
    # This should be done via a k8s readinessProbe, which however doesn't
    # work, see https://github.com/kubernetes/kubernetes/issues/99979
    # below is a workaround for the readinessProbe bug, in that we implement
    # the 6h interval here
    check_interval=360
    
    cd /heartbeat
    
    check_and_touch_last_check_result() {
      # you want to mount heartbeat-api-health-check.yml into the image!
      if HEARTBEAT_LOG=STDOUT bin/heartbeat --config heartbeat-api-health-check.yml | grep -q "Unable to retrieve the active server ip"; then
        echo 1 > /tmp/last_check_result
        exit 1
      else
        echo 0 > /tmp/last_check_result
        exit 0
      fi
    }
    
    if [ -e /tmp/last_check_result ]; then
      if [[ $(find /tmp/last_check_result -mmin +$check_interval -print) ]]; then
        check_and_touch_last_check_result
      else
        exit `cat /tmp/last_check_result`
      fi
    else
      check_and_touch_last_check_result
    fi
    

    And finally prepare the heartbeat-api-health-check.yml config file for heartbeat that will ping an unreachable IP address once and then access Hetzner’s Robot API, which if not allowed to access, will make the readinessProbe fail and thus mark the Pod as “Ready: False” which you can get alerted on via your prefered monitoring tool or maybe just by checking manually:

    base_url: https://robot-ws.your-server.de
    basic_auth:
      username: username
      password: password
    failover_ip: 0.0.0.0
    
    ping_ip: 0.0.0.1 # invalid IP!!!
    
    ips:
      - ping: 1.1.1.1
        target: 1.1.1.1
      - ping: 2.2.2.2
        target: 2.2.2.2
    
    timeout: 1
    tries: 1
    only_once: true
    dry: true