Configuring Consul DNS Forwarding in Ubuntu 16.04

29 May 2019

One of the advantages of using Consul for service discovery is that besides an HTTP API, you can also query it by DNS.

The DNS server is listening on port 8600 by default, and you can query both A records or SRV records from it. SRV records are useful as they contain additional properties (priority, weight and port), and you can get multiple records back from a single query, letting you do load balancing client side:

$ dig @localhost -p 8600 consul.service.consul SRV +short

1 10 8300 vagrant1.node.dc1.consul.
1 14 8300 vagrant2.node.dc1.consul.
2 100 8300 vagrant3.node.dc1.consul.

A Records are also useful, as it means we should be able to treat services registered to Consul like any other domain - but it doesn’t work:

$ curl http://consul.service.consul:8500
curl: (6) Could not resolve host: consul.service.consul

The reason for this is that the system’s built-in DNS resolver doesn’t know how to query Consul. We can, however, configure it to forward any *.consul requests to Consul.

Solution - Forward DNS queries to Consul

As I usually target Ubuntu based machines, this means configuring systemd-resolved to forward to Consul. However, we want to keep Consul listening on it’s default port (8600), and systemd-resolved can only forward requests to port 53, so we need also to configure iptables to redirect the requests.

The steps are as follows:

  1. Configure systemd-resolved to forward .consul TLD queries to the local consul agent
  2. Configure iptables to redirect 53 to 8600

So let’s get to it!

1. Make iptables persistent

IPTables configuration changes don’t persist through reboots, so the easiest way to solve this is with the iptables-persistent package.

Typically I am scripting machines (using [Packer] or [Vagrant]), so I configure the install to be non-interactive:

echo iptables-persistent iptables-persistent/autosave_v4 boolean false | sudo debconf-set-selections
echo iptables-persistent iptables-persistent/autosave_v6 boolean false | sudo debconf-set-selections

sudo DEBIAN_FRONTEND=noninteractive apt install -yq iptables-persistent

2. Update Systemd-Resolved

The file to change is /etc/systemd/resolved.conf. By default it looks like this:

#FallbackDNS= 2001:4860:4860::8888 2001:4860:4860::8844

We need to change the DNS and Domains lines - either editing the file by hand, or scripting a replacement with sed:

sudo sed -i 's/#DNS=/DNS=; s/#Domains=/Domains=~consul/g' /etc/systemd/resolved.conf

The result of which is the file now reading like this:

#FallbackDNS= 2001:4860:4860::8888 2001:4860:4860::8844

By specifying the Domains as ~consul, we are telling resolvd to forward requests for the consul TLD to the server specified in the DNS line.

3. Configure Resolvconf too

For compatibility with some applications (e.g. curl and ping), we also need to update /etc/resolv.conf to specify our local nameserver. You do this not by editing the file directly!

Instead, we need to add nameserver to /etc/resolvconf/resolv.conf.d/head. Again, I will script this, and as we need sudo to write to the file, the easiest way is to use tee to append the line and then run resolvconf -u to apply the change:

echo "nameserver" | sudo tee --append /etc/resolvconf/resolv.conf.d/head
sudo resolvconf -u

Configure iptables

Finally, we need to configure iptables so that when systemd-resolved sends a DNS query to localhost on port 53, it gets redirected to port 8600. We’ll do this for both TCP and UDP requests, and then use netfilter-persistent to make the rules persistent:

sudo iptables -t nat -A OUTPUT -d localhost -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
sudo iptables -t nat -A OUTPUT -d localhost -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600

sudo netfilter-persistent save


First, we can test that both Consul and Systemd-Resolved return an address for a consul service:

$ dig @localhost -p 8600 consul.service.consul +short

$ dig @localhost consul.service.consul +short

And now we can try using curl to verify that we can resolve consul domains and normal domains still:

$ curl -s -o /dev/null -w "%{http_code}\n" http://consul.service.consul:8500/ui/

$ curl -s -o /dev/null -w "%{http_code}\n"


There are also guides available on how to do this on Hashicorp’s website, covering other DNS resolvers too (such as BIND, Dnsmasq, Unbound).

infrastructure, consul


Running a Secure RabbitMQ Cluster in Nomad

06 Apr 2019

Last time I wrote about running a RabbitMQ cluster in Nomad, one of the main pieces of feedback I received was about the (lack) of security of the setup, so I decided to revisit this, and write about how to launch as secure RabbitMQ node in Nomad.

The things I want to cover are:

  • Username and Password for the management UI
  • Secure value for the Erlang Cookie
  • SSL for Management and AMQP

As usual, the demo repository with all the code is available if you’d rather just jump into that.

Configure Nomad To Integrate With Vault

To manage the certificates and credentials I will use another Hashicorp tool called Vault, which provides Secrets As A Service. It can be configured for High Availability, but for the demo, we will just use a single instance on one of our Nomad machines.


We’ll update the Vagrant script used in the last post about Nomad Rabbitmq Clustering to add in a single Vault node. This is not suitable for using Vault in production; for that there should be a separate Vault cluster running somewhere, but as this post is focusing on how to integrate with Vault, a single node will suffice.

Once we have Vault installed (see the script), we need to set up a few parts. First is a PKI (public key infrastructure), better known as a Certificate Authority (CA). We will generate a single root certificate and have our client machines (and optionally the host machine) trust that one certificate.

As this the machines are running in Hyper-V with the Default Switch, we can use the inbuilt domain name,, and provide our own certificates. This script is run as part of the Server (nomad1) provisioning script, but in a production environment would be outside of this scope.

vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki

vault write -field=certificate pki/root/generate/internal common_name="$domain" ttl=87600h \
    > /vagrant/vault/mshome.crt

vault write pki/config/urls \
    issuing_certificates="$VAULT_ADDR/v1/pki/ca" \

vault write pki/roles/rabbit \
    allowed_domains="$domain" \
    allow_subdomains=true \
    generate_lease=true \

sudo cp /vagrant/vault/mshome.crt /usr/local/share/ca-certificates/mshome.crt
sudo update-ca-certificates

If you don’t want scary screens in FireFox and Chrome, you’ll need to install the mshome.crt certificate into your trust store.

Next up, we have some policies we need in Vault. The first deals with what the Nomad Server(s) are allowed to do - namely to handle tokens for itself, and anything in the nomad-cluster role. A full commented version of this policy is available here.

path "auth/token/create/nomad-cluster" {
  capabilities = ["update"]

path "auth/token/roles/nomad-cluster" {
  capabilities = ["read"]

path "auth/token/lookup-self" {
  capabilities = ["read"]

path "auth/token/lookup" {
  capabilities = ["update"]

path "auth/token/revoke-accessor" {
  capabilities = ["update"]

path "sys/capabilities-self" {
  capabilities = ["update"]

path "auth/token/renew-self" {
  capabilities = ["update"]

As this policy mentions the nomad-cluster role a few times, let’s have a look at that also:

  "disallowed_policies": "nomad-server",
  "explicit_max_ttl": 0,
  "name": "nomad-cluster",
  "orphan": true,
  "period": 259200,
  "renewable": true

This allows a fairly long-lived token to be created, which can be renewed. It is also limiting what the tokens are allowed to do, which can be done as either a block list (disallowed_policies) or an allow list (allowed_policies). In this case, I am letting the Clients access any policies except the nomad-server policy.

We can install both of these into Vault:

vault policy write nomad-server /vagrant/vault/nomad-server-policy.hcl
vault write auth/token/roles/nomad-cluster @/vagrant/vault/nomad-cluster-role.json


Now that Vault is up and running, we should configure Nomad to talk to it. This is done in two places - the Server configuration, and the Client configuration.

To configure the Nomad Server, we update it’s configuration file to include a vault block, which contains a role name it will use to generate tokens (for itself and for the Nomad Clients), and an initial token.

vault {
    enabled = true
    address = "http://localhost:8200"
    task_token_ttl = "1h"
    create_from_role = "nomad-cluster"
    token = "some_token_here"

The initial token is generated by the ./ script - how you go about doing this in production will vary greatly depending on how you are managing your machines.

The Nomad Clients also need the Vault integration enabling, but in their case, it only needs the location of Vault, as the Server node(s) will provide tokens for the clients to use.

vault {
    enabled = true
    address = ""

Job Requirements

Before we go about changing the job itself, we need to write some data into Vault for the job to use:

  • Credentials: Username and password for the RabbitMQ Management UI, and the RABBITMQ_ERLANG_COOKIE
  • A policy for the job allowing Certificate Generation and Credentials access


First off, we need to create a username and password to use with the Management UI. This can be done via the Vault CLI:

vault kv put secret/rabbit/admin \
    username=administrator \
    password=$(cat /proc/sys/kernel/random/uuid)

For the Erlang Cookie, we will also generate a Guid, but this time we will store it under a separate path in Vault so that it can be locked down separately to the admin username and password if needed:

vault kv put secret/rabbit/cookie \
    cookie=$(cat /proc/sys/kernel/random/uuid)

Job Policy

Following the principle of Least Privilege, we will create a policy for our rabbit job which only allows certificates to be generated, and rabbit credentials to be read.

path "pki/issue/rabbit" {
  capabilities = [ "create", "read", "update", "delete", "list" ]

path "secret/data/rabbit/*" {
  capabilities = [ "read" ]

This is written into Vault in the same way as the other policies were:

vault policy write rabbit /vagrant/vault/rabbit-policy.hcl

Rabbit Job Configuration

The first thing we need to do to the job is specify what policies we want to use with Vault, and what to do when a token or credential expires:

task "rabbit" {
  driver = "docker"

  vault {
    policies = ["default", "rabbit"]
    change_mode = "restart"


To configure RabbitMQ to use SSL, we need to provide it with values for 3 environment variables:

  • RABBITMQ_SSL_CACERTFILE - The CA certificate
  • RABBITMQ_SSL_CERTFILE - The Certificate for RabbitMQ to use
  • RABBITMQ_SSL_KEYFILE - the PrivateKey for the RabbitMQ certificate

So let’s add a template block to the job to generate and write out a certificate. It’s worth noting that line endings matter. You either need your .nomad file to use LF line endings, or make the template a single line and use \n to add the correct line endings in. I prefer to have the file with LF line endings.

template {
  data = <<EOH
{{ $host := printf "" (env "attr.unique.hostname") }}
{{ with secret "pki/issue/rabbit" $host "format=pem" }}
{{ .Data.certificate }}
{{ .Data.private_key }}{{ end }}
  destination = "secrets/rabbit.pem"
  change_mode = "restart"

As we want to use the Nomad node’s hostname within the common_name parameter of the secret, we need to use a variable to fetch and format the value:

{{ $host := printf "" (env "attr.unique.hostname") }}

This can then be used by the with secret block to fetch a certificate for the current host:

{{ with secret "pki/issue/rabbit" $host "format=pem" }}

Now that we have a certificate in the ./secrets/ directory, we can add a couple of volume mounts to the container, and set the environment variables with the container paths to the certificates. Note how the root certificate is coming from the /vagrant directory, not from Vault itself. Depending on how you are provisioning your machines to trust your CA, you will have a different path here!

config {
  image = "pondidum/rabbitmq:consul"
  # ...
  volumes = [

env {
  RABBITMQ_SSL_CACERTFILE = "/etc/ssl/certs/mshome.crt"
  RABBITMQ_SSL_CERTFILE = "/etc/ssl/certs/rabbit.pem"
  RABBITMQ_SSL_KEYFILE = "/etc/ssl/certs/rabbit.pem"

You should also notice that we are writing the secrets/rabbit.pem file into the container twice: The second write is to a file in /tmp as a workaround for the script. If we don’t create this file ourselves, the container script will create it by combining the RABBITMQ_SSL_CERTFILE file and RABBITMQ_SSL_KEYFILE file, which will result in an invalid certificate, and a nightmare to figure out…

If the Vault integration in Nomad could write a single generated secret to two separate files, we wouldn’t need this workaround. Alternatively, you could make a custom container with a customised startup script to deal with this for you.

You can see the version of this file with only these changes here


Now that we have things running with a certificate, it would be a great idea to start using the Erlang Cookie value and Management UI credentials we stored in Vault earlier. This is a super easy change to support in the Nomad file - we need to add another template block, but this time set env = true which will instruct nomad that the key-values in the template should be loaded as environment variables:

template {
    data = <<EOH
    {{ with secret "secret/data/rabbit/cookie" }}
    {{ end }}
    {{ with secret "secret/data/rabbit/admin" }}
    {{ end }}
    destination = "secrets/rabbit.env"
    env = true

The complete nomad file with both certificates and credentials can be seen here.


Now, all we need to do is start our new secure cluster:

nomad job run rabbit/secure.nomad

Client Libraries

Now that you have a secure version of RabbitMQ running, there are some interesting things which can be done with the client libraries. While you can just use the secure port, RabbitMQ also supports Peer Verification, which means that the client has to present a certificate for itself, and RabbitMQ will validate that both certificates are signed by a common CA.

This process can be controlled with two environment variables:

  • RABBITMQ_SSL_VERIFY set to either verify_peer or verify_none
  • RABBITMQ_SSL_FAIL_IF_NO_PEER_CERT set to true to require client certificates, false to make them optional

In .net land, if you are using MassTransit, the configuration looks like this:

var bus = Bus.Factory.CreateUsingRabbitMq(c =>
    c.Host("rabbitmq://", r =>
        r.UseSsl(ssl =>
            ssl.CertificatePath = @"secrets/app.crt";

There are also lots of other interesting things you can do with SSL and RabbitMQ, such as using the certificate as authentication rather than needing a username and password per app. But you should be generating your app credentials dynamically with Vault too…

Wrapping Up

Finding all the small parts to make this work was quite a challenge. The Nomad gitter was useful when trying to figure out the certificates issue, and being able to read the source code of the Docker image for RabbitMQ was invaluable to making the Certificate work.

If anyone sees anything I’ve done wrong, or could be improved, I’m happy to hear it!

infrastructure, vagrant, nomad, consul, rabbitmq, vault


Hyper-V, Docker, and Networking Drama

22 Mar 2019

I had a major problem a few hours before giving my Nomad: Kubernetes Without the Complexity talk this morning: the demo stopped working.

Now, the first thing to note is the entire setup of the demo is scripted, and the scripts hadn’t changed. The only thing I had done was restart the machine, and now things were breaking.

The Symptoms

A docker container started inside the guest VMs with a port mapped to the machine’s public IP wasn’t resolvable outside the host.

For example, using a machine based off the bento/ubuntu-16.04 base box, provisioned with docker, running this from inside an SSH connection to the machine would work:

vagrant ssh

# launch a container which can respond to a http get
docker run -d --rm -p registry:latest

# curl by public ip
curl --silent -w "%{http_code}"   # 200

But running the same curl command on the host would fail:

# container is still running
curl --silent -w "%{http_code}"   # timeout


So it’s 5 hours before the demo (thankfully it’s not 10 minutes before), so let’s start digging into what could be causing this.

Docker Networking

I also was searching for Nomad and Docker networking issues - as I figured I could change the Nomad job to bind the container to all interfaces (e.g. -p 5000:5000) instead of just the one IP. This reply mentioned the docker0 network, and when I checked the guest machines, I saw that this network is also in the 172.* range.

So my guest machines had public addresses which happened to fall in the same range as a separate network adaptor on that machine.

Hyper-V IP Addresses

While I was checking the Windows Firewall to see if anything was weird in there, I stumbled across a rule I’d added to allow exposure of a NodeJS service from my host to Hyper-v guests (but not anywhere else). I noticed that the IP range it defined was 192.168.*, and I now had machines with 172.* addresses.

So the IP address range for guest machines had changed.

The Solution

Luckily, there is a straightforward solution to this:

Reboot until you get the range you want


The other solution is to use an External Switch in Hyper-V and bridge it with your host’s internet connection, which doesn’t really help me, as I am on a laptop, on different WiFi networks, and sometimes I use a thunderbolt based network adaptor too. And having to update/rebuild machines on every network change would be an absolute pain.

So I rebooted — a lot.

So if anyone from Microsoft is reading this: Please let us configure the Default Switch. Or have a way to recreate it without rebooting at least.

vagrant, docker, hyperv, networking


RabbitMQ clustering with Consul in Nomad

28 Jan 2019

Update If you want a secure version of this cluster, see Running a Secure RabbitMQ Cluster in Nomad.

RabbitMQ is the centre of a lot of micros service architectures, and while you can cluster it manually, it is a lot easier to use some of the auto clustering plugins, such as AWS (EC2), Consul, Etcd, or Kubernetes. As I like to use Nomad for container orchestration, I thought it would be a good idea to show how to cluster RabbitMQ when it is running in a Docker container, on an unknown host (i.e. one picked by Nomad.)

I ran into a few problems trying to get this working, but a lot of searching and some help from the RabbitMQ mailing list (thanks Luke!) got me through all the issues, so hopefully, this will be easier next time and for other people too.

It is also worth noting that this is only going to be covering how to make a cluster work, not how to make it secure (setting up TLS etc.) for production usage. There is a lot of documentation on the RabbitMQ website for further reading on this!

The full repository with all of the demo code is available on my Github.

Nomad Cluster

As this post is mostly about running RabbitMQ on Nomad, and not setting up Nomad, I’ll give the basics here - the full scripts are in the repository, and linked below too.

Vagrant is used to create us a three node cluster. As I use Hyper-V for VMs, I can’t set static IPs from the Vagrant file, so I have used another Vagrant feature: triggers.

Triggers let us specify scripts to run after Vagrant actions, so in this case, we run a script after machine1 comes up which writes out it’s IP to the /vagrant share. The other machines can then read this same file to join the cluster:

Vagrant.configure(2) do |config| = "bento/ubuntu-16.04"

  config.vm.provision "shell", path: "./", privileged: false

  config.vm.define "n1" do |n1|
    n1.vm.provision "shell", path: "./", privileged: false

    n1.trigger.after :up do |trigger|
      trigger.run_remote = { inline: "ip route get 1 | awk '{print $NF;exit}' > /vagrant/server_ip" }

  config.vm.define "n2" do |n2|
    n2.vm.hostname = "nomad2"
    n2.vm.provision "shell", path: "./", privileged: false

The script downloads and installs both Consul and Nomad, and then the respective and scripts set up both services in the right ways. The server machine also acts as a Nomad client, so that we don’t need 4 VMs running.


All that remains to be done is vagrant up…and wait. But while we’re waiting for the machines to provision, let’s have a look at RabbitMQ clustering, and how we’ll define the job in Nomad.

RabbitMQ Cluster

A few things to note about clustering RabbitMQ:

  • All nodes must be listening on the same port for clustering (4369 by default)
  • The ERL_EPMD_PORT variable doesn’t work on rabbitmq < 3.7.9
  • The latest Docker image for rabbitmq is 3.7.8
  • The rabbit node names must be DNS resolvable
  • The RABBITMQ_ERLANG_COOKIE must have the same value

The rabbitmq_peer_discovery_consul plugin we will use is shipped with RabbitMQ by default but is disabled. The easiest way to get everything up and running is to create your own docker container, with the plugin enabled, and a small configuration file to set a few options:

FROM rabbitmq:management-alpine
COPY rabbitmq.conf /etc/rabbitmq
RUN rabbitmq-plugins enable --offline rabbitmq_peer_discovery_consul

The rabbitmq.conf only needs a few lines:

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_formation.consul.svc_addr_auto = true
docker build -t rabbitmq:consul .

An image built from this is also available on docker hub.

Once we have a custom container built, it’s a good idea to test that it actually works, before we start trying to get Nomad to run it. We’ll do this by creating a network in Docker so that all the containers can talk directly to each other on their pre-defined ports. Don’t forget to change CONSUL_HOST to your machine’s IP address!

docker network create rabbit
docker run -d --rm --name consul -p 8500:8500 consul

docker run -d --rm --name rabbit1 -h rabbit1 --network rabbit -p 30001:15672 -e RABBITMQ_ERLANG_COOKIE='rabbit' -e 'RABBITMQ_DEFAULT_USER=test' -e 'RABBITMQ_DEFAULT_PASS=test' -e CONSUL_HOST='' rabbitmq:consul
docker run -d --rm --name rabbit2 -h rabbit2 --network rabbit -p 30002:15672 -e RABBITMQ_ERLANG_COOKIE='rabbit' -e 'RABBITMQ_DEFAULT_USER=test' -e 'RABBITMQ_DEFAULT_PASS=test' -e CONSUL_HOST='' rabbitmq:consul
docker run -d --rm --name rabbit3 -h rabbit3 --network rabbit -p 30003:15672 -e RABBITMQ_ERLANG_COOKIE='rabbit' -e 'RABBITMQ_DEFAULT_USER=test' -e 'RABBITMQ_DEFAULT_PASS=test' -e CONSUL_HOST='' rabbitmq:consul

You can now visit http://localhost:30001 (or 30002 or 30003) and see that we have a successful cluster running. Once you’re happy with it, you can kill it all off (as we started the containers with the --rm flag, Docker will delete them for us when they stop):

docker stop rabbit1 rabbit2 rabbit3 consul
docker network rm rabbit

Nomad Rabbit Job

Now that we know our container clusters successfully, we can create a Job definition to do the same thing in Nomad. Nomad jobs are defined in HCL, a Json-like configuration language.

The jobs require a name, which datacentre it should run in, and what kind of job type it is. In this case, our job is called rabbit (imaginative I know), we’ll run it in dc1 (the default value Nomad starts with), and we’ll make this job be a service, as opposed to a batch or system job:

job "rabbit" {

  datacenters = ["dc1"]
  type = "service"

  group "cluster" {
    # tasks ...

The group is used to hold a collection of tasks, and when allocating a job, Nomad will make sure that all tasks belonging to an instance of a group are on the same node.

So for example, if you had a 2 node Nomad cluster, and 3 instances of a group which contained 3 tasks (e.g. API, varnish, and nginx containers), Nomad might distribute the groups like so:

image of several nodes with groups of containers

Within the group, we can specify the number of instances we want with the count property, and we also specify that for both updates and migrations, only one group can be changed at a time. This means that if you decide to upgrade the container used by the job, Nomad won’t stop all instances at once, destroying your service’s availability!

We also specify that we want to use the health checks (defined later on) rather than the state of the task itself to determine what is healthy, and how long the task must be healthy for before we decide it’s actually healthy, and how long it has to achieve being healthy.

group "cluster" {
  count = 3

  update {
    max_parallel = 1

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "5s"
    healthy_deadline = "30s"

The task is our unit of work in Nomad. In this case, we are using the docker driver, but it also supports many other drivers including exec, rkt and lxc. We configure which image to use, and importantly that the hostname is the name from Nomad!

The port_map tells nomad which ports of the container we want to expose, and labels them. We can then refer to the ports by their labels in other parts of the configuration.

task "rabbit" {
  driver = "docker"

  config {
    image = "pondidum/rabbitmq:consul"
    hostname = "${attr.unique.hostname}"
    port_map {
      amqp = 5672
      ui = 15672
      epmd = 4369
      clustering =

The env section is pretty self-explanatory; they are environment variables to pass to the container. As Consul is running on the Nomad host, we use the Nomad interpolation attribute to specify the IP of the current host, and we also set the RABBITMQ_ERLANG_COOKIE to a specific value. In a production environment, you should be setting this value to something unguessable, possibly using the Vault intergration in Nomad to fetch a token. We can also add other settings to pass to the container here, such as RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS. As with the cookie generation, in a production-like environment, you’d probably want to use the Vault integration to pull the values for these variables.

env {
  CONSUL_HOST = "${}"

The resources section lets us constraints on things like CPU, Memory, IOPs, and Network. In our case, we are only specifying a set of ports to expose on the network, and that we want them to be bound to specific ports on the host:

resources {
  network {
    port "amqp" { static = 5672 }
    port "ui" { static = 15672 }
    port "epmd" { static = 4369 }
    port "clustering" { static = 25672 }

We could select different ports to bind the container ports to, or leave out the static pair entirely to have Nomad map the ports to random unused ports on the host.

Finally, the service block integrates with service discovery (so, Consul), and allows us to register ports and health checks for our service. In the case of our RabbitMQ cluster, we already have service discovery integration via the RabbitMQ Consul plugin, so this registration is only used for the check feature, which is what will also be used by the migrate block to see if a task is healthy:

service {
  check {
    name     = "alive"
    type     = "tcp"
    port     = "ui"
    interval = "10s"
    timeout  = "2s"

The check is using the ui port defined earlier to check if the UI is alive. We could also change the health check to use the amqp port instead, as that might be a better indication that the actual service can do useful things. We can define multiple checks, and are not limited to TCP; grpc, http, and script are also supported.

The entire job definition is below, and is also available in the repository.

job "rabbit" {

  datacenters = ["dc1"]
  type = "service"

  group "cluster" {
    count = 3

    update {
      max_parallel = 1

    migrate {
      max_parallel = 1
      health_check = "checks"
      min_healthy_time = "5s"
      healthy_deadline = "30s"

    task "rabbit" {
      driver = "docker"

      config {
        image = "pondidum/rabbitmq:consul"
        hostname = "${attr.unique.hostname}"
        port_map {
          amqp = 5672
          ui = 15672
          epmd = 4369
          clustering = 25672

      env {
        RABBITMQ_ERLANG_COOKIE = "generate_a_guid_-_or_something_for_this"
        RABBITMQ_DEFAULT_USER = "test"
        RABBITMQ_DEFAULT_PASS = "test"
        CONSUL_HOST = "${}"

      resources {
        network {
          port "amqp" { static = 5672 }
          port "ui" { static = 15672 }
          port "epmd" { static = 4369 }
          port "clustering" { static = 25672 }

      service {
        name = "rabbitmq"
        port = "ui"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"


Running The Job

First, make sure your console can talk to Nomad, which we can do by using the server_ip file again:

export NOMAD_ADDR="http://$(cat server_ip):4646"

Now it should be possible to run the job:

nomad job run rabbit/rabbit.nomad

After a few moments, we can visit any of the Nomad hosts, and log in to the RabbitMQ UI (http://SOME_SERVER_IP:15672) and you should be greeted with a running cluster with three nodes:

rabbitmq cluster dashboard


  • Kill a container on one of the nodes and see what happens (vagrant ssh n2 then docker stop <SOME_CONTAINER_ID>)
  • Create an Application which you deploy to Nomad which uses service discovery to talk to RabbitMQ
  • Create some more Nomad clients, and try making a bigger RabbitMQ cluster

If you have any questions, feel free to comment below or send me a tweet.

infrastructure, vagrant, nomad, consul, rabbitmq


Testing Immutable Infrastructure

01 Jan 2019

In my previous post, I glossed over one of the most important and useful parts of Immutable Infrastructure: Testability. There are many kinds of tests we can write for our infrastructure, but they should all be focused on the machine/service and maybe it’s nearest dependencies, not the entire system.

While this post focuses on testing a full machine (both locally in a VM, and remotely as an Amazon EC2 instance), it is also possible to do most of the same kind of tests against a Docker container. In fact, one of the tools used in this post supports building Docker containers as an output in parallel to the AMIs, so this can also assist in providing a migration path to/from Docker.

As an example, I will show how I built and tested a LogStash machine, including how to verify that the script to create the production machine is valid, that the machine itself has been provisioned correctly, and that the services inside work as expected.

I have published all the source code to GitHub. The examples in this post are all taken from the repository but might have a few bits removed just for readability. Check the full source out if you are interested!

Repository Structure and Tools

When it comes to building anything that you will have lots of, consistency is key to making it manageable. To that end, I have a small selection of tools that I use, and a repository structure I try and stick to. They are the following:

Vagrant - This is a tool for building and managing virtual machines. It can be backed by many different providers such as Docker, HyperV and VirtualBox. We’ll use this to build a local Linux machine to develop and test LogStash in. I use the HyperV provisioner, as that is what Docker For Windows also uses, and HyperV disables other virtualisation tools.

Packer - This tool provides a way to build machine images. Where Vagrant builds running machines, Packer builds the base images for you to boot, and can build multiple different ones (in parallel) from one configuration. We’ll use this to create our AMIs (Amazon Machine Images.)

Jest - This is a testing framework written in (and for) NodeJS applications. Whatever testing tool works best for your environment is what you should be using, but I use Jest as it introduces minimal dependencies, is cross-platform, and has some useful libraries for doing things like diffing json.

The repository structure is pretty simple:

  • scripts/
  • src/
  • test/
  • logstash.json
  • package.json
  • vagrantfile

The src directory is where our application code will live. If the application is compiled, the output goes to the build directory (which is not tracked in source-control.) The test directory will contain all of our tests, and the scripts directory will contain everything needed for provisioning our machines.

We’ll describe what the use of each of these files is as we go through the next section.

Local Development

To create our virtual machine locally, we will use Vagrant. To tell Vagrant how to build our machine, we need to create a vagrantfile in our repository, which will contain the machine details and provisioning steps.

The machine itself has a name, CPU count, and memory specified. There is also a setting for Hyper-V which allows us to use a differencing disk, which reduces the startup time for the VM, and how much disk space it uses on the host machine.

For provisioning, we specify to run the relevant two files from the scripts directory.

Vagrant.configure("2") do |config| = "bento/ubuntu-16.04"

    config.vm.provider "hyperv" do |hv|
        hv.vmname = "LogStash"
        hv.cpus = 1
        hv.memory = 2048
        hv.linked_clone = true

    config.vm.provision "shell", path: "./scripts/"
    config.vm.provision "shell", path: "./scripts/"

To keep things as similar as possible between our development machine and our output AMI, I keep as much of the setup script in one file: scripts/ In the case of our LogStash setup, this means installing Java, LogStash, some LogStash plugins, and enabling the service on reboots:

#! /bin/bash

# add elastic's package repository
wget -qO - | sudo apt-key add -
echo "deb stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
sudo apt-get update

# install openjdk and set environment variable
sudo apt-get install openjdk-8-jre -y
JAVA=$(readlink -f $(which java) | sed "s:bin/java::")
echo "JAVA_HOME=$JAVA" | sudo tee --append /etc/environment

#install logstash and plugins
sudo apt-get install logstash -y
/usr/share/logstash/bin/logstash-plugin install logstash-filter-uuid
/usr/share/logstash/bin/logstash-plugin install logstash-filter-prune

sudo systemctl enable logstash.service

Vagrant will automatically mount it’s working directory into the VM under the path /vagrant. This means we can add a second provisioning script (scripts/ to link the /vagrant/src directory to the LogStash configuration directory (/etc/logstash/conf.d), meaning we can edit the files on the host machine, and then restart LogStash to pick up the changes.

#! /bin/bash
sudo rm -rf /etc/logstash/conf.d
sudo ln -s /vagrant/src /etc/logstash/conf.d

sudo systemctl start logstash.service

Now that we have a vagrantfile, we can start the virtual machine with a single command. Note, Hyper-V requires administrator privileges, so you need to run this command in an admin terminal:

vagrant up

After a while, your new LogStash machine will be up and running. If you want to log into the machine and check files an processes etc., you can run the following command:

vagrant ssh

An argument can also be provided to the ssh command to be executed inside the VM, which is how I usually trigger LogStash restarts (as it doesn’t seem to detect when I save the config files in the src directory):

vagrant ssh -c 'sudo systemctl restart logstash'


To create the deployable machine image, I use Packer. The process is very similar to how Vagrant is used: select a base AMI, create a new EC2 machine, provision it, and save the result as a new AMI.

Packer is configured with a single json file, in this case, named logstash.json. The file is split into four parts: variables, builders, provisioners, and outputs. I won’t include the outputs section as it’s not needed when building AMIs.


The variables property is for all configuration that you can pass to Packer. Their values can come from Environment Variables, CLI parameters, Consul, Vault, and others. In the LogStash example, there are three variables:

  "variables": {
    "aws_access_key": "",
    "aws_secret_key": "",
    "ami_users": "{{env `AMI_ACCOUNTS`}}"

The aws_access_key and aws_secret_key are known names - unless we specify some value, they will automatically be read from your AWS config (in ~/.aws/), or if running on EC2, from the EC2 machine profile.

The ami_users is a custom variable which will read the AMI_ACCOUNTS environment variable by default. This particular one is used so that I can grant access to the resulting AMI to multiple AWS accounts, which is useful if you’re running in an Organisation with multiple Accounts. For example, if the AMI is built in a common account, and will be deployed into dev, qa and prod accounts, then you would populate the AMI_ACCOUNTS as a CSV of account IDs.


Packer can build many different kinds of machine image, but for this, we only need one: amazon-ebs.

  "builders": [
      "type": "amazon-ebs",
      "access_key": "{{user `aws_access_key`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "region": "eu-west-1",
      "instance_type": "t2.micro",
      "source_ami_filter": {
        "filters": {
          "virtualization-type": "hvm",
          "name": "ubuntu/images/*ubuntu-xenial-16.04-amd64-server-*",
          "root-device-type": "ebs"
        "owners": ["099720109477"],
        "most_recent": true
      "ssh_username": "ubuntu",
      "ami_name": "logstash {{timestamp}}",
      "ami_users": "{{user `ami_users`}}"

The two most interesting properties of this are source_ami_filter and ami_users. The source_ami_filter works in a very similar manner to the AWS CLI’s describe-images --filters parameter, albeit in a more readable format. In this case, I am specifying that I want an ubuntu-xenial base, and I want it to be an official Canonical image, so specify their Account ID as the owner. I also specify the most_recent property, as this filter will return all versions of this AMI which Canonical publish.

The ami_users is what lets me grant access to the AMI from other accounts (rather than just making it public). The property’s value should be an array, but Packer is smart enough to expand the CSV in the user variable into an array for us.


The provisioners array items are executed in the order they are specified. To set up the machine, I use the shell provisioner to create a temporary directory, then the file provisioner to upload the files in the src directory to that temporary directory. Finally a second shell provisioner uploads and runs the scripts/ and scripts/ files.

  "provisioners": [
      "type": "shell",
      "inline": "mkdir -p /tmp/src"
      "type": "file",
      "source": "./src/",
      "destination": "/tmp/src"
      "type": "shell",
      "scripts": ["./scripts/", "./scripts/"]

The file is very small and does roughly the same thing as the script, but rather than symlinking the /vagrant directory, it moves the uploaded src directory into the right location for LogStash:

#! /bin/sh

sudo rm /etc/logstash/conf.d/*
sudo cp -r /tmp/src/* /etc/logstash/conf.d

Note that this doesn’t start the LogStash service - this gets done by the UserData when we launch a new instance, as often we need to pass in additional configuration parameters, and don’t want the service running until that has been done.


To create the AMI, we need to invoke packer. If I am running packer on a remote machine via SSH, I run it inside tmux, so that disconnects don’t fail the process:

packer build -var "ami_users=111,222,333" logstash.json

After a while, Packer will finish, leaving you with an output which will include the new AMI ID:

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:

eu-west-1: ami-123123123

We’ll get back to this output later when we create a build script that will also run our tests. Before we get to that, however, let’s look at how we can write tests which target both the local Vagrant machine and the AMI too.


To test the machines, I am using Jest. There isn’t anything particularly interesting going on in the package.json, other than a few babel packages being installed so that I can use ES6 syntax:

  "scripts": {
    "watch": "jest --watch",
    "test": "jest "
  "devDependencies": {
    "babel-core": "^6.26.3",
    "babel-jest": "^23.6.0",
    "babel-preset-env": "^1.7.0",
    "jest": "^23.6.0",
    "regenerator-runtime": "^0.13.1"

Packer Configuration Testing

There are a number of tests we can do to make sure our Packer configuration is valid before running it. This includes things like checking the base AMI is from a whitelisted source (such as our accounts, Amazon and Canonical). The test has to handle the possibility of multiple builders, and that some builders might not have a source_ami_filter. It also handles if no owner has been specified at all, which we also consider a “bad thing”:

const ourAccounts = [ "111111", "222222", "333333", "444444" ];
const otherOwners = [ "amazon", "099720109477" /*canonical*/ ];

describe("ami builder", () => {

  it("should be based on a whitelisted owner", () => {
    const allOwners = ourAccounts.concat(otherOwners);
    const invalidOwners = owners => owners.filter(owner => !allOwners.includes(owner));

    const amisWithInvalidOwners =
      .filter(builder => builder.source_ami_filter)
      .map(builder => ({
        name: builderName(builder),
        invalidOwners: invalidOwners(builder.source_ami_filter.owners || [ "NO OWNER SPECIFIED" ])
      .filter(builders => builders.invalidOwners.length > 0);



I also test that certain variables (ami_users) have been defined, and have been used in the right place:

describe("variables", () => {
  it("should have a variable for who can use the ami", () => {

  it("should read ami_users from AMI_ACCOUNTS", () => {

describe("ami builder", () => {
  it("should set the ami_user", () => {

    const invalidUsers =
      .map(builder => ({
        name: builderName(builder),
        users: builder.ami_users || "NO USERS SPECIFIED"
      .filter(ami => !ami.users.match(/{{\s*user\s*`ami_users`\s*}}/));


Other tests you might want to add are that the base AMI is under a certain age, or that your AMI has certain tags included, or that it is named in a specific manner.

Machine Testing

Machine testing is for checking that our provisioning worked successfully. This is very useful, as subtle bugs can creep in when you don’t verify what happens.

For example, a machine I built copied configuration directory to a target location but was missing the -r flag, so when I later added a subdirectory, the machine failed as the referenced files didn’t exist.

So that the tests work with both the Vagrant and Packer built versions, we take in their address and key paths from the environment:

import { spawnSync } from "child_process";
import { createConnection } from "net";

// figure out where to look these up
const host = process.env.LOGSTASH_ADDRESS; // e.g. "";
const keyPath = process.env.LOGSTASH_KEYPATH; // ".vagrant/machines/default/hyperv/private_key";

We also define two helper methods: one to check if a TCP port is open, and one which uses SSH to execute a command and read the response in the machine:

const execute = command => {
  const args = [`[email protected]${host}`, `-i`, keyPath, command];
  const ssh = spawnSync("ssh", args, { encoding: "utf8" });
  const lines = ssh.stdout.split("\n");

  if (lines[lines.length - 1] === "") {
    return lines.slice(0, lines.length - 1);
  return lines;

const testPort = port => new Promise((resolve, reject) => {
  const client = createConnection({ host: host, port: port });

  client.on("error", err => reject(err));
  client.on("connect", () => {

We can then add some tests which check the files were written to the right place, that port 5044 is open, and port 9600 is closed:

describe("the machine", () => {

  it("should have the correct configuration", () => {
    const files = execute("find /etc/logstash/conf.d/* -type f");


  it("should be listening on 5044 for beats", () => testPort(5044));
  it("should not be listening on 9600", () => expect(testPort(9600)).rejects.toThrow("ECONNREFUSED"));

Of course, as we can execute any command inside the machine, we can check pretty much anything:

  • tail the LogStash log and see if it’s got the right contents
  • check if the service is started
  • check the service is enabled on boot
  • check the environment variables been written to the right files

Application Testing

There are two styles of Application Testing: white-box and black-box. White-box will be tests run on the application inside the machine, using minimal external dependencies (preferably none at all), and Black-box will be run on the application from outside the machine, either using direct dependencies, or fakes.

It’s worth noting that both white-box and black-box tests are slow, mostly down to how slow LogStash is at starting up, although only giving it 1 CPU and 2Gb of RAM probably doesn’t help.

Whitebox Testing LogStash

To white-box test LogStash, I use a technique partially based on the Agolo LogStash Test Runner. The process for the tests is to run LogStash interactively (rather than as a service), send it a single event, record the output events, and compare them to an expected output.

The test cases are kept in separate folders, with two files. First is the input file, imaginatively called input.log, which will contain one json encoded event per line. The format needs to match what the result of FileBeat sending an event to LogStash would be. In this case, it means a few extra fields, and a message property containing a string of json. Formatted for readability, the object looks like this:

  "@timestamp": "2018-12-27T14:08:24.753Z",
  "beat": { "hostname": "Spectre", "name": "Spectre", "version": "5.3.0" },
  "fields": { "environment": "local", "log_type": "application" },
  "input_type": "log",
  "message": "{\"Timestamp\": \"2018-12-18T17:06:27.7112297+02:00\",\"Level\": \"Information\",\"MessageTemplate\": \"This is the {count} message\",\"Properties\": {\"count\": 4,\"SourceContext\": \"LogLines.GetOpenPurchasesHandler\",\"ApplicationName\": \"FileBeatTest\",\"CorrelationId\": \"8f341e8e-6b9c-4ebf-816d-d89c014bad90\",\"TimedOperationElapsedInMs\": 1000}}",
  "offset": 318,
  "source": "D:\\tmp\\logs\\single.log",
  "type": "applicationlog"

I also define an output.log, which contains the expected result(s), again one json encoded event per line. The example pipeline in the repository will emit two events for a given input, so this file contains two lines of json (again, newlines added for readability here):

  "source": "D:\\tmp\\logs\\single.log",
  "@version": "1",
  "fields": { "log_type": "application", "environment": "local" },
  "@timestamp": "2018-12-18T15:06:27.711Z",
  "offset": 318,
  "ApplicationName": "FileBeatTest",
  "host": "ubuntu-16",
  "type": "applicationlog",
  "CorrelationId": "8f341e8e-6b9c-4ebf-816d-d89c014bad90",
  "MessageTemplate": "This is the {count} message",
  "Level": "Information",
  "Context": "LogLines.GetOpenPurchasesHandler",
  "TimeElapsed": 1000,
  "Properties": { "count": 4 }
  "duration": 1000000,
  "timestamp": 1545145586711000,
  "id": "<generated>",
  "traceid": "8f341e8e6b9c4ebf816dd89c014bad90",
  "name": "LogLines.GetOpenPurchasesHandler",
  "localEndpoint": { "serviceName": "FileBeatTest" }

To enable sending the lines directly to LogStash (rather than needing to use FileBeat), we define an input.conf file, which configures LogStash to read json from stdin:

input {
  stdin { codec => "json_lines" }

And an ouput.conf file which configures LogStash to write the output as json lines a known file path:

output {
  file {
    path => "/tmp/test/output.log"
    codec => "json_lines"

The tests need to be run inside the machine itself, so I created a script in the ./scripts directory which will do all the work, and can be run by the execute method in a Jest test. The script stops the LogStash service, copies the current configuration from the ./src directory and the replacement input.conf and output.conf files to a temporary location, and then runs LogStash once per test case, copying the result file to the test case’s directory.

#! /bin/bash

sudo systemctl stop logstash


sudo rm -rf "$temp_path/*"
sudo mkdir -p $temp_path
sudo cp -r /vagrant/src/* $temp_path
sudo cp $test_source/*.conf $temp_path

find $test_source/* -type d | while read test_path; do
    echo "Running $(basename $test_path) tests..."

    sudo /usr/share/logstash/bin/logstash \
        "--path.settings" "/etc/logstash" \
        "--path.config" "$temp_path" \
        < "$test_path/input.log"

    sudo touch "$temp_path/output.log"   # create it if it doesn't exist (dropped logs etc.)
    sudo rm -f "$test_path/result.log"
    sudo mv "$temp_path/output.log" "$test_path/result.log"

    echo "$(basename $test_path) tests done"

sudo systemctl start logstash

To execute this, we use the beforeAll function to run it once - we also pass in Number.MAX_SAFE_INTEGER as by default beforeAll will time out after 5 seconds, and the is slow as hell (as LogStash takes ages to start up).

Once the script has finished running, we load each test’s output.log and result.log files, parse each line as json, compare the objects, and print out the delta if the objects are not considered equal:

const source = "./test/acceptance";
const isDirectory = p => fs.lstatSync(p).isDirectory();

const cases = fs
  .map(name => path.join(source, name))

describe("logstash", () => {
    () => execute("/vagrant/scripts/"),

  test.each(cases)("%s", directoryPath => {
    const expected = readFile(path.join(directoryPath, "output.log"));
    const actual = readFile(path.join(directoryPath, "result.log"));

    const diffpatch = new DiffPatcher({
      propertyFilter: (name, context) => {
        if (name !== "id") {
          return true;

        return !== "<generated>";

    const delta = diffpatch.diff(expected, actual);
    const output = formatters.console.format(delta);

    if (output.length) {


Blackbox Testing LogStash

As the machine has ports open for FileBeat and will send it’s output to ElasticSearch, we can set up a fake HTTP server, send some log events via FileBeat to the VM and check we receive the right HTTP calls to our fake server.

While looking on how to do this, I came across the lumberjack-protocol package on NPM, but unfortunately, it only supports lumberjack v1, and FileBeat and LogStash are now using v2, so you would have to use a local copy of filebeat to do the sending.

Due to the complexity of implementing this, and the diminished return on investment (the other tests should be sufficient), I have skipped creating the Blackbox tests for the time being.

AMI Testing

The final phase! Now that we are reasonably sure everything works locally, we need to build our AMI and test that everything works there too, as it would be a shame to update an Auto Scale Group with the new image which doesn’t work!

All that needs to happen to run the tests against an EC2 instance is to set the three environment variables we used with Vagrant, to values for communicating with the EC2 instance. To do this, we’ll need the EC2 IP Address, the username for SSH, and the private key for SSH authentication.

The first thing our build script needs to do is create the AMI. This is done in the same way as mentioned earlier, but with the slight difference of also piping the output to tee:

packer_log=$(packer build logstash.json | tee /dev/tty)
ami_id=$(echo "$packer_log" | tail -n 1 | sed 's/.*\(ami.*\)/\1/')

By using tee, we can pipe the build log from Packer to both the real terminal (/dev/tty), and to a variable called packer_log. The script then takes the last line and uses some regex to grab the AMI ID.

Next up, the script uses the AWS CLI to launch an EC2 instance based on the AMI, and store it’s IP Address and Instance ID:

json=$(aws ec2 run-instances \
  --image-id "$ami_id" \
  --instance-type t2.small \
  --key-name "$keypair_name" \
  --region eu-west-1 \
  --subnet-id "$subnet_id" \
  --security-group-ids "$security_group_id" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=logstash-verification}]' \
  --user-data "$userdata")

instance_id=$(echo "$json" | jq -r .Instances[0].InstanceId)
private_ip=$(echo "$json" | jq -r .Instances[0].PrivateIpAddress)

The IP Address is then used to set up the environment variables which the node test scripts use to locate the machine:

LOGSTASH_KEYPATH="~/.ssh/id_rsa" build ou

npm run test

Finally, the script uses the Instance ID to terminate the instance:

aws ec2 terminate-instances \
  --instance-ids "$instance_id"

Wrapping Up

Hopefully, this (rather long) post is a useful introduction (!) to how I tackle testing Immutable Infrastructure. All of these techniques for testing the machine and application can be used for testing things like Docker containers too (and handily, Packer can be used to create Docker containers also).

As mentioned earlier The Repository is available here.

logstash, microservices, infrastructure, vagrant, packer, aws, testing