Feature Toggles: Reducing Coupling

11 Jun 2019

One of the points I make in my Feature Toggles talk is that you shouldn’t be querying a toggle’s status all over your codebase. Ideally, each toggle gets checked in as few places as possible - preferably only one place. The advantage of doing this is that very little of your codebase needs to be coupled to the toggles (either the toggle itself or the library/system for managing toggles itself).

This post will go over several situations when that seems hard to do, namely: multiple services, multiple distinct areas of a codebase, and multiple times in a complex class or method. As in the previous post on this, we will be using Branch By Abstraction to do most of the heavy lifting.

Multiple Services

Multiple services interacting with the same feature toggle is a problematic situation to deal with, especially if multiple teams own the different services.

One of the main issues with this is trying to coordinate the two (or more) services. For example, if one team needs to switch off their implementation due to a problem, should the other services also get turned off too? To compound on this problem, what happens if one system can react to the toggle change faster than the other?

Services changing configuration at different speeds can also cause issues with handling in-flight requests too: if the message format is different when the toggle is on, will the receiving system be able to process a message produced when the toggle was in one state but consumed in the other state?

We can solve some of this by using separate toggles for each service (and they are not allowed to query the other service’s toggle state), and by writing the services so that they can handle both old format and new format requests at the same time.

For example, if we had a sending system which when the toggle is off will send this DTO:

public class PurchaseOptions
{
    public Address Address { get; set; }
}

And when the toggle is enabled, it will send the following DTO instead:

public class PurchaseOptions
{
    public BillingAddress Address { get; set; }
    public DeliveryAddress Address { get; set; }
}

To make the receiving system handle this, we deserialize the request into a DTO which contains all possible versions of the address, and then use the best version based on our own toggle state:

public class PurchaseOptionsRequest
{
    public Address Address { get; set; }
    public BillingAddress Address { get; set; }
    public DeliveryAddress Address { get; set; }
}

public class PurchaseController
{
    public async Task<PurchaseOptionsResponse> Post(PurchaseOptionsRequest request)
    {
        if (separateAddresses.Enabled)
        {
            var deliveryAddress = request.DeliveryAddress ?? request.Address;
            var billingAddress = request.BillingAddress ?? request.Address;

            ConfigureDelivery(deliveryAddress);
            CreateInvoice(billingAddress, deliveryAddress);
        }
        else
        {
            var address = request.Address ?? request.DeliveryAddress ?? request.BillingAddress;

            ConfigureDelivery(address)
            CreateInvoice(address, address);
        }
    }
}

Note how both sides of the toggle check read all three possible address fields, but try to use different fields first. This means that no matter whether the sending service has it’s toggle on or not, we will use the correct address.

Multiple Areas of the Codebase

To continue using the address example, we might have a UI, Controller and Handler, which all need to act differently based on the same toggle:

  • The UI needs to display either one or two address editors
  • The controller needs to have different validation logic for multiple addresses
  • The Command Handler will need to dispatch different values

We can solve this all by utilising Branch By Abstraction and Dependency Injection to make most of the codebase unaware that a feature toggle exists. Even the implementations won’t need to know about the toggles.

public class Startup
{
    public void ConfigureContainer(ServiceRegistry services)
    {
        if (separateAddresses.Enabled) {
            services.Add<IAddressEditor, MultiAddressEditor>();
            services.Add<IRequestValidator, MultiAddressValidator>();
            services.Add<IDeliveryHandler, MultiAddressDeliveryHandler>();
        }
        else {
            services.Add<IAddressEditor, SingleAddressEditor>();
            services.Add<IRequestValidator, SingleAddressValidator>();
            services.Add<IDeliveryHandler, SingleAddressDeliveryHandler>();
        }
    }
}

Let’s look at how one of these might work. The IRequestValidator has a definition like so:

public interface IRequestValidator<TRequest>
{
    public IEnumerable<string> Validate(TRequest request);
}

There is a middleware in the API request pipeline which will pick the right validator out of the container, based on the request type being processed. We implement two validators, once for the single address, and one for multiaddress:

public class SingleAddressValidator : IRequestValidator<SingleAddressRequest>
{
    public IEnumerable<string> Validate(SingleAddressRequest request)
    {
        //complex validation logic..
        if (request.Address == null)
            yield return "No Address specified";

        if (PostCode.Validate(request.Address.PostCode) == false)
            yield return "Invalid Postcode";
    }
}

public class MultiAddressValidator : IRequestValidator<MultiAddressRequest>
{
    public IEnumerable<string> Validate(MultiAddressRequest request)
    {
        var billingMessages = ValidateAddress(request.BillingAddress);

        if (billingMessages.Any())
            return billingMessages;

        if (request.DifferentDeliveryAddress)
            return ValidateAddress(request.DeliveryAddress);
    }
}

The implementations themselves don’t need to know about the state of the toggle, as the container and middleware take care of picking the right implementation to use.

Multiple Places in a Class/Method

If you have a single method (or class) which needs to check the toggle state in multiple places, you can also use the same Branch by Abstraction technique as above, by creating a custom interface and pair of implementations, which contain all the functionality which changes.

For example, if we have a method for finding an offer for a customer’s basket, which has a few separate checks that the toggle is enabled in it:

public SuggestedBasket CreateOffer(CreateOfferCommand command)
{
    if (newFeature.Enabled) {
        ExtraPreValidation(command).Throw();
    } else {
        StandardPreValidation(command).Throw();
    }

    var offer = SelectBestOffer(command.Items);

    if (offer == null && newFeature.Enabled) {
        offer = FindAlternativeOffer(command.Customer, command.Items);
    }

    return SuggestedBasket
        .From(command)
        .With(offer);
}

We can extract an interface for this, and replace the toggle specific parts with calls to the interface instead:

public interface ICreateOfferStrategy
{
    IThrowable PreValidate(CreateOfferCommand command);
    Offer AlternativeOffer(CreateOfferCommand command, Offer existingOffer);
}

public class DefaultOfferStrategy : ICreateOfferStrategy
{
    public IThrowable PreValidate(CreateOfferCommand command)
    {
        return StandardPreValidation(command);
    }

    public Offer AlternativeOffer(CreateOfferCommand command, Offer existingOffer)
    {
        return existingOffer;
    }
}

public class DefaultOfferStrategy : ICreateOfferStrategy
{
    public IThrowable PreValidate(CreateOfferCommand command)
    {
        return ExtraPreValidation(command);
    }

    public Offer AlternativeOffer(CreateOfferCommand command, Offer existingOffer)
    {
        if (existingOffer != null)
            return existingOffer;

        return TryFindAlternativeOffer(command.Customer, command.Items, offer);
    }
}

public class OfferBuilder
{
    private readonly ICreateOfferStrategy _strategy;

    public OfferBuilder(ICreateOfferStrategy strategy)
    {
        _strategy = strategy;
    }

    public SuggestedBasket CreateOffer(CreateOfferCommand command)
    {
        _strategy.PreValidation(command).Throw();

        var offer = SelectBestOffer(command.Items);

        offer = _strategy.AlternativeOffer(command, offer);

        return SuggestedBasket
            .From(command)
            .With(offer);
    }
}

Now that we have done this, our CreateOffer method has shrunk dramatically and no longer needs to know about the toggle state, as like the rest of our DI examples, the toggle can be queried once in the startup of the service and the correct ICreateOfferStrategy implementation registered into the container.

End

Hopefully, this post will give a few insights into different ways of reducing the number of calls to your feature toggling library, and prevent you scattering lots of if statements around the codebase!

featuretoggles, c#, di, microservices

---

Feature Toggles: Branch by Abstraction

03 Jun 2019

Recently, I was asked if I could provide an example of Branch By Abstraction when dealing with feature toggles. As this has come up a few times, I thought a blog post would be a good idea so I can refer others to it later too.

The Context

As usual, this is some kind of backend (micro)service, and it will send email messages somehow. We will start with two implementations of message sending: the “current” version; which is synchronous, and a “new” version; which is async.

We’ll do a bit of setup to show how feature toggling can be done in three ways for this feature:

  1. Static: Configured on startup
  2. Dynamic: Check the toggle state on each send
  3. Dynamic: Check the toggle for a given message

Abstractions and Implementations

We have an interface called IMessageDispatcher which defines a single Send method, which returns a Task (or Promise, Future, etc. depending on your language.)

public interface IMessageDispatcher
{
    Task<SendResult> Send(Message message);
}

The two message sending implementations don’t matter, but we need the types to show the other code examples. Fill in the blanks if you want!

public class HttpMessageDispatcher : IMessageDispatcher
{
    // ...
}

public class QueueMessageDispatcher : IMessageDispatcher
{
    // ...
}

1. Static Configuration

The word static in this context means that we check the feature toggle’s state once on startup and pick an implementation. We don’t recheck the toggle state unless the service is restarted.

For instance, in an ASP.Net core application, you could change which service is registered into the container at startup like so:

public void ConfigureServices(IServiceCollection services)
{
    var toggleSource = new ToggleSource(/* ... */);

    if (toggleSource.IsActive(Toggles.AsyncMessageDispatch))
        services.AddTransient<IMessageDispatcher, QueueMessageDispatcher>();
    else
        services.AddTransient<IMessageDispatcher, HttpMessageDispatcher>();
}

Which means any class which takes in an instance of IMessageDispatcher doesn’t need to check the toggle state or worry about which implementation to use.

2. Dynamic Configuration

We can build on this abstraction to enable more flexibility, if we want to be able to change the toggle state while the service is running, without needing to restart it. To do this, we can implement another version of the IMessageDispatcher interface which will check the toggle state on each invocation of Send():

public class ToggleDispatcher : IMessageDispatcher
{
    private readonly Func<bool> _isToggleActive;
    private readonly IMessageDispatcher _queueSender;
    private readonly IMessageDispatcher _httpSender;

    public ToggleDispatcher(Func<bool> isToggleActive, IMessageDispatcher queueSender, IMessageDispatcher httpSender)
    {
        _isToggleActive = isToggleActive;
        _queueSender = queueSender;
        _httpSender = httpSender;
    }

    public Task<SendResult> Send(Message message)
    {
        var chosen = _isToggleActive()
            ? _queueSender
            : _httpSender;

        return chosen.Send(message);
    }
}

And in our startup class, we can change the service registration to use the new version. Note how we are now registering the two concrete versions into the container so that they can be resolved later by the ToggleDispatcher registration:

public void ConfigureServices(IServiceCollection services)
{
    var toggleSource = new ToggleSource(/* ... */);

    services.AddTransient<HttpMessageDispatcher>();
    services.AddTransient<QueueMessageDispatcher>();

    services.AddTransient<IMessageDispatcher>(context => new ToggleDispatcher(
        () => toggleSource.IsActive(Toggles.AsyncMessageDispatch),
        context.GetService<QueueMessageDispatcher>(),
        context.GetService<HttpMessageDispatcher>())
    );
}

3. Dynamic(er) Configuration

We can take this another step further too, if we want to be able to have a phased rollout of this new QueueMessageDispatcher, for example, based on the sender address. In this case, we can create another decorator which uses the individual message to make the decision. The only difference to the original ToggleDispatcher is that the first argument now also provides a Message object:

public class MessageBasedToggleDispatcher : IMessageDispatcher
{
    private readonly Func<Message, bool> _isToggleActive;
    private readonly IMessageDispatcher _queueSender;
    private readonly IMessageDispatcher _httpSender;

    public MessageBasedToggleDispatcher(Func<Message, bool> isToggleActive, IMessageDispatcher queueSender, IMessageDispatcher httpSender)
    {
        _isToggleActive = isToggleActive;
        _queueSender = queueSender;
        _httpSender = httpSender;
    }

    public Task<SendResult> Send(Message message)
    {
        var chosen = _isToggleActive(message)
            ? _queueSender
            : _httpSender;

        return chosen.Send(message);
    }
}

The startup registration is modified to pass the message property we care about to the ToggleSource, with the toggleSource.IsActive() call being responsible for what to do with the key we have passed in. Perhaps it does something like a consistent hash of the address, and if the value is above a certain threshold the toggle is active, or maybe it queries a whitelist of people who the toggle is enabled for.

public void ConfigureServices(IServiceCollection services)
{
    var toggleSource = new ToggleSource(/* ... */);

    services.AddTransient<HttpMessageDispatcher>();
    services.AddTransient<QueueMessageDispatcher>();

    services.AddTransient<IMessageDispatcher>(context => new MessageBasedToggleDispatcher(
        message => toggleSource.IsActive(Toggles.AsyncMessageDispatch, message.SenderAddress),
        context.GetService<QueueMessageDispatcher>(),
        context.GetService<HttpMessageDispatcher>())
    );
}

Conclusion

This method of branching is extremly flexible, as it allows us to use toggles to replace feature implementations, but also gives us lots of places where we can add other decorators to add functionality to the pipeline. For example, we could add an auditing decorator or one which implements the outbox pattern - and the calling code which depends only on IMessageDispatcher doesn’t need to care.

featuretoggles, c#, di, microservices

---

Configuring Consul DNS Forwarding in Alpine Linux

31 May 2019

Following on from the post the other day on setting up DNS forwarding to Consul with SystemD, I wanted also to show how to get Consul up and running under Alpine Linux, as it’s a little more awkward in some respects.

To start with, I am going to setup Consul as a service - I didn’t do this in the Ubuntu version, as there are plenty of useful articles about that already, but that is not the case with Alpine.

Run Consul

First, we need to get a version of Consul and install it into our system. This script downloads 1.5.1 from Hashicorp’s releases site, installs it to /usr/bin/consul, and creates a consul user and group to run the daemon with:

CONSUL_VERSION=1.5.1

curl -sSL https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip -o /tmp/consul.zip

unzip /tmp/consul.zip
sudo install consul /usr/bin/consul

sudo addgroup -S consul
sudo adduser -S -D -h /var/consul -s /sbin/nologin -G consul -g consul consul

Next, we need to create the directories for the configuration and data to live in, and copy the init script and configuration file to those directories:

consul_dir=/etc/consul
data_dir=/srv/consul

sudo mkdir $consul_dir
sudo mkdir $data_dir
sudo chown consul:consul $data_dir

sudo mv /tmp/consul.sh /etc/init.d/consul
sudo chmod +x /etc/init.d/consul

sudo mv /tmp/consul.json $consul_dir/consul.json

The init script is pretty straight forward, but note that I am running the agent in this example in dev mode; don’t do this in production:

#!/sbin/openrc-run
CONSUL_LOG_FILE="/var/log/${SVCNAME}.log"

name=consul
description="A tool for service discovery, monitoring and configuration"
description_checkconfig="Verify configuration file"
daemon=/usr/bin/$name
daemon_user=$name
daemon_group=$name
consul_dir=/etc/consul
extra_commands="checkconfig"

start_pre() {
    checkpath -f -m 0644 -o ${SVCNAME}:${SVCNAME} "$CONSUL_LOG_FILE"
}

depend() {
    need net
    after firewall
}

checkconfig() {
    consul validate $consul_dir
}

start() {
    checkconfig || return 1

    ebegin "Starting ${name}"
        start-stop-daemon --start --quiet \
            -m --pidfile /var/run/${name}.pid \
            --user ${daemon_user} --group ${daemon_group} \
            -b --stdout $CONSUL_LOG_FILE --stderr $CONSUL_LOG_FILE \
            -k 027 --exec ${daemon} -- agent -dev -config-dir=$consul_dir
    eend $?
}

stop() {
    ebegin "Stopping ${name}"
        start-stop-daemon --stop --quiet \
            --pidfile /var/run/${name}.pid \
            --exec ${daemon}
    eend $?
}

Finally, a basic config file to launch consul is as follows:

{
    "data_dir": "/srv/consul/data",
    "client_addr": "0.0.0.0"
}

Now that all our scripts are in place, we can register Consul into the service manager, and start it:

sudo rc-update add consul
sudo rc-service consul start

You can check consul is up and running by using dig to get the address of the consul service itself:

dig @localhost -p 8600 consul.service.consul

Setup Local DNS with Unbound

Now that Consul is running, we need to configure a local DNS resolver to forward requests for the .consul domain to Consul. We will use Unbound as it works nicely on Alpine. It also has the wonderful feature of being able to send queries to a specific port, so no iptables rules needed this time!

The config file (/etc/unbound/unbound.conf) is all default values, with the exception of the last 5 lines, which let us forward DNS requests to a custom, and insecure, location:

#! /bin/bash

sudo apk add unbound

(
cat <<-EOF
server:
    verbosity: 1
    root-hints: /etc/unbound/root.hints
    trust-anchor-file: "/usr/share/dnssec-root/trusted-key.key"
    do-not-query-localhost: no
    domain-insecure: "consul"
stub-zone:
    name: "consul"
    stub-addr: [email protected]
EOF
) | sudo tee /etc/unbound/unbound.conf

sudo rc-update add unbound
sudo rc-service unbound start

We can validate this works again by using dig, but this time removing the port specification to hit 53 instead:

dig @localhost consul.service.consul

Configure DNS Resolution

Finally, we need to update /etc/resolv.conf so that other system tools such as ping and curl can resolve .consul addresses. This is a little more hassle on Alpine, as there are no head files we can push our nameserver entry into. Instead, we use dhclient which will let us prepend a custom nameserver (or multiple) when the interface is brought up, even when using DHCP:

#! /bin/bash

sudo apk add dhclient

(
cat <<-EOF
option rfc3442-classless-static-routes code 121 = array of unsigned integer 8;
send host-name = gethostname();
request subnet-mask, broadcast-address, time-offset, routers,
        domain-name, domain-name-servers, domain-search, host-name,
        dhcp6.name-servers, dhcp6.domain-search, dhcp6.fqdn, dhcp6.sntp-servers,
        netbios-name-servers, netbios-scope, interface-mtu,
        rfc3442-classless-static-routes, ntp-servers;
prepend domain-name-servers 127.0.0.1;
EOF
) | sudo tee /etc/dhcp/dhclient.conf

sudo rm /etc/resolv.conf # hack due to it dhclient making an invalid `chown` call.
sudo rc-service networking restart

The only thing of interest here is the little hack: we delete the /etc/resolv.conf before restarting the networking service, as if you don’t do this, you get errors about “chmod invalid option resource=…”.

We can varify everything works in the same way we did on Ubuntu; curl to both a .consul and a public address:

$ curl -s -o /dev/null -w "%{http_code}\n" http://consul.service.consul:8500/ui/
200

$ curl -s -o /dev/null -w "%{http_code}\n" http://google.com
301

End

This was a bit easier to get started with than the Ubuntu version as I knew what I was trying to accomplish this time - however making a good init.d script was a bit more hassle, and the error from chmod took some time to track down.

infrastructure, consul, alpine

---

Configuring Consul DNS Forwarding in Ubuntu 16.04

29 May 2019

One of the advantages of using Consul for service discovery is that besides an HTTP API, you can also query it by DNS.

The DNS server is listening on port 8600 by default, and you can query both A records or SRV records from it. SRV records are useful as they contain additional properties (priority, weight and port), and you can get multiple records back from a single query, letting you do load balancing client side:

$ dig @localhost -p 8600 consul.service.consul SRV +short

1 10 8300 vagrant1.node.dc1.consul.
1 14 8300 vagrant2.node.dc1.consul.
2 100 8300 vagrant3.node.dc1.consul.

A Records are also useful, as it means we should be able to treat services registered to Consul like any other domain - but it doesn’t work:

$ curl http://consul.service.consul:8500
curl: (6) Could not resolve host: consul.service.consul

The reason for this is that the system’s built-in DNS resolver doesn’t know how to query Consul. We can, however, configure it to forward any *.consul requests to Consul.

Solution - Forward DNS queries to Consul

As I usually target Ubuntu based machines, this means configuring systemd-resolved to forward to Consul. However, we want to keep Consul listening on it’s default port (8600), and systemd-resolved can only forward requests to port 53, so we need also to configure iptables to redirect the requests.

The steps are as follows:

  1. Configure systemd-resolved to forward .consul TLD queries to the local consul agent
  2. Configure iptables to redirect 53 to 8600

So let’s get to it!

1. Make iptables persistent

IPTables configuration changes don’t persist through reboots, so the easiest way to solve this is with the iptables-persistent package.

Typically I am scripting machines (using [Packer] or [Vagrant]), so I configure the install to be non-interactive:

echo iptables-persistent iptables-persistent/autosave_v4 boolean false | sudo debconf-set-selections
echo iptables-persistent iptables-persistent/autosave_v6 boolean false | sudo debconf-set-selections

sudo DEBIAN_FRONTEND=noninteractive apt install -yq iptables-persistent

2. Update Systemd-Resolved

The file to change is /etc/systemd/resolved.conf. By default it looks like this:

[Resolve]
#DNS=
#FallbackDNS=8.8.8.8 8.8.4.4 2001:4860:4860::8888 2001:4860:4860::8844
#Domains=
#LLMNR=yes
#DNSSEC=no

We need to change the DNS and Domains lines - either editing the file by hand, or scripting a replacement with sed:

sudo sed -i 's/#DNS=/DNS=127.0.0.1/g; s/#Domains=/Domains=~consul/g' /etc/systemd/resolved.conf

The result of which is the file now reading like this:

[Resolve]
DNS=127.0.0.1
#FallbackDNS=8.8.8.8 8.8.4.4 2001:4860:4860::8888 2001:4860:4860::8844
Domains=~consul
#LLMNR=yes
#DNSSEC=no

By specifying the Domains as ~consul, we are telling resolvd to forward requests for the consul TLD to the server specified in the DNS line.

3. Configure Resolvconf too

For compatibility with some applications (e.g. curl and ping), we also need to update /etc/resolv.conf to specify our local nameserver. You do this not by editing the file directly!

Instead, we need to add nameserver 127.0.0.1 to /etc/resolvconf/resolv.conf.d/head. Again, I will script this, and as we need sudo to write to the file, the easiest way is to use tee to append the line and then run resolvconf -u to apply the change:

echo "nameserver 127.0.0.1" | sudo tee --append /etc/resolvconf/resolv.conf.d/head
sudo resolvconf -u

Configure iptables

Finally, we need to configure iptables so that when systemd-resolved sends a DNS query to localhost on port 53, it gets redirected to port 8600. We’ll do this for both TCP and UDP requests, and then use netfilter-persistent to make the rules persistent:

sudo iptables -t nat -A OUTPUT -d localhost -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
sudo iptables -t nat -A OUTPUT -d localhost -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600

sudo netfilter-persistent save

Verification

First, we can test that both Consul and Systemd-Resolved return an address for a consul service:

$ dig @localhost -p 8600 consul.service.consul +short
10.0.2.15

$ dig @localhost consul.service.consul +short
10.0.2.15

And now we can try using curl to verify that we can resolve consul domains and normal domains still:

$ curl -s -o /dev/null -w "%{http_code}\n" http://consul.service.consul:8500/ui/
200

$ curl -s -o /dev/null -w "%{http_code}\n" http://google.com
301

End

There are also guides available on how to do this on Hashicorp’s website, covering other DNS resolvers too (such as BIND, Dnsmasq, Unbound).

infrastructure, consul

---

Running a Secure RabbitMQ Cluster in Nomad

06 Apr 2019

Last time I wrote about running a RabbitMQ cluster in Nomad, one of the main pieces of feedback I received was about the (lack) of security of the setup, so I decided to revisit this, and write about how to launch as secure RabbitMQ node in Nomad.

The things I want to cover are:

  • Username and Password for the management UI
  • Secure value for the Erlang Cookie
  • SSL for Management and AMQP

As usual, the demo repository with all the code is available if you’d rather just jump into that.

Configure Nomad To Integrate With Vault

To manage the certificates and credentials I will use another Hashicorp tool called Vault, which provides Secrets As A Service. It can be configured for High Availability, but for the demo, we will just use a single instance on one of our Nomad machines.

Vault

We’ll update the Vagrant script used in the last post about Nomad Rabbitmq Clustering to add in a single Vault node. This is not suitable for using Vault in production; for that there should be a separate Vault cluster running somewhere, but as this post is focusing on how to integrate with Vault, a single node will suffice.

Once we have Vault installed (see the provision.sh script), we need to set up a few parts. First is a PKI (public key infrastructure), better known as a Certificate Authority (CA). We will generate a single root certificate and have our client machines (and optionally the host machine) trust that one certificate.

As this the machines are running in Hyper-V with the Default Switch, we can use the inbuilt domain name, mshome.net, and provide our own certificates. This script is run as part of the Server (nomad1) provisioning script, but in a production environment would be outside of this scope.

domain="mshome.net"
vault secrets enable pki
vault secrets tune -max-lease-ttl=87600h pki

vault write -field=certificate pki/root/generate/internal common_name="$domain" ttl=87600h \
    > /vagrant/vault/mshome.crt

vault write pki/config/urls \
    issuing_certificates="$VAULT_ADDR/v1/pki/ca" \
    crl_distribution_points="$VAULT_ADDR/v1/pki/crl"

vault write pki/roles/rabbit \
    allowed_domains="$domain" \
    allow_subdomains=true \
    generate_lease=true \
    max_ttl="720h"

sudo cp /vagrant/vault/mshome.crt /usr/local/share/ca-certificates/mshome.crt
sudo update-ca-certificates

If you don’t want scary screens in FireFox and Chrome, you’ll need to install the mshome.crt certificate into your trust store.

Next up, we have some policies we need in Vault. The first deals with what the Nomad Server(s) are allowed to do - namely to handle tokens for itself, and anything in the nomad-cluster role. A full commented version of this policy is available here.

path "auth/token/create/nomad-cluster" {
  capabilities = ["update"]
}

path "auth/token/roles/nomad-cluster" {
  capabilities = ["read"]
}

path "auth/token/lookup-self" {
  capabilities = ["read"]
}

path "auth/token/lookup" {
  capabilities = ["update"]
}

path "auth/token/revoke-accessor" {
  capabilities = ["update"]
}

path "sys/capabilities-self" {
  capabilities = ["update"]
}

path "auth/token/renew-self" {
  capabilities = ["update"]
}

As this policy mentions the nomad-cluster role a few times, let’s have a look at that also:

{
  "disallowed_policies": "nomad-server",
  "explicit_max_ttl": 0,
  "name": "nomad-cluster",
  "orphan": true,
  "period": 259200,
  "renewable": true
}

This allows a fairly long-lived token to be created, which can be renewed. It is also limiting what the tokens are allowed to do, which can be done as either a block list (disallowed_policies) or an allow list (allowed_policies). In this case, I am letting the Clients access any policies except the nomad-server policy.

We can install both of these into Vault:

vault policy write nomad-server /vagrant/vault/nomad-server-policy.hcl
vault write auth/token/roles/nomad-cluster @/vagrant/vault/nomad-cluster-role.json

Nomad

Now that Vault is up and running, we should configure Nomad to talk to it. This is done in two places - the Server configuration, and the Client configuration.

To configure the Nomad Server, we update it’s configuration file to include a vault block, which contains a role name it will use to generate tokens (for itself and for the Nomad Clients), and an initial token.

vault {
    enabled = true
    address = "http://localhost:8200"
    task_token_ttl = "1h"
    create_from_role = "nomad-cluster"
    token = "some_token_here"
}

The initial token is generated by the ./server.sh script - how you go about doing this in production will vary greatly depending on how you are managing your machines.

The Nomad Clients also need the Vault integration enabling, but in their case, it only needs the location of Vault, as the Server node(s) will provide tokens for the clients to use.

vault {
    enabled = true
    address = "http://nomad1.mshome.net:8200"
}

Job Requirements

Before we go about changing the job itself, we need to write some data into Vault for the job to use:

  • Credentials: Username and password for the RabbitMQ Management UI, and the RABBITMQ_ERLANG_COOKIE
  • A policy for the job allowing Certificate Generation and Credentials access

Credentials

First off, we need to create a username and password to use with the Management UI. This can be done via the Vault CLI:

vault kv put secret/rabbit/admin \
    username=administrator \
    password=$(cat /proc/sys/kernel/random/uuid)

For the Erlang Cookie, we will also generate a Guid, but this time we will store it under a separate path in Vault so that it can be locked down separately to the admin username and password if needed:

vault kv put secret/rabbit/cookie \
    cookie=$(cat /proc/sys/kernel/random/uuid)

Job Policy

Following the principle of Least Privilege, we will create a policy for our rabbit job which only allows certificates to be generated, and rabbit credentials to be read.

path "pki/issue/rabbit" {
  capabilities = [ "create", "read", "update", "delete", "list" ]
}

path "secret/data/rabbit/*" {
  capabilities = [ "read" ]
}

This is written into Vault in the same way as the other policies were:

vault policy write rabbit /vagrant/vault/rabbit-policy.hcl

Rabbit Job Configuration

The first thing we need to do to the job is specify what policies we want to use with Vault, and what to do when a token or credential expires:

task "rabbit" {
  driver = "docker"

  vault {
    policies = ["default", "rabbit"]
    change_mode = "restart"
  }
  #...
}

Certificates

To configure RabbitMQ to use SSL, we need to provide it with values for 3 environment variables:

  • RABBITMQ_SSL_CACERTFILE - The CA certificate
  • RABBITMQ_SSL_CERTFILE - The Certificate for RabbitMQ to use
  • RABBITMQ_SSL_KEYFILE - the PrivateKey for the RabbitMQ certificate

So let’s add a template block to the job to generate and write out a certificate. It’s worth noting that line endings matter. You either need your .nomad file to use LF line endings, or make the template a single line and use \n to add the correct line endings in. I prefer to have the file with LF line endings.

template {
  data = <<EOH
{{ $host := printf "common_name=%s.mshome.net" (env "attr.unique.hostname") }}
{{ with secret "pki/issue/rabbit" $host "format=pem" }}
{{ .Data.certificate }}
{{ .Data.private_key }}{{ end }}
EOH
  destination = "secrets/rabbit.pem"
  change_mode = "restart"
}

As we want to use the Nomad node’s hostname within the common_name parameter of the secret, we need to use a variable to fetch and format the value:

{{ $host := printf "common_name=%s.mshome.net" (env "attr.unique.hostname") }}

This can then be used by the with secret block to fetch a certificate for the current host:

{{ with secret "pki/issue/rabbit" $host "format=pem" }}

Now that we have a certificate in the ./secrets/ directory, we can add a couple of volume mounts to the container, and set the environment variables with the container paths to the certificates. Note how the root certificate is coming from the /vagrant directory, not from Vault itself. Depending on how you are provisioning your machines to trust your CA, you will have a different path here!

config {
  image = "pondidum/rabbitmq:consul"
  # ...
  volumes = [
    "/vagrant/vault/mshome.crt:/etc/ssl/certs/mshome.crt",
    "secrets/rabbit.pem:/etc/ssl/certs/rabbit.pem",
    "secrets/rabbit.pem:/tmp/rabbitmq-ssl/combined.pem"
  ]
}

env {
  RABBITMQ_SSL_CACERTFILE = "/etc/ssl/certs/mshome.crt"
  RABBITMQ_SSL_CERTFILE = "/etc/ssl/certs/rabbit.pem"
  RABBITMQ_SSL_KEYFILE = "/etc/ssl/certs/rabbit.pem"
  #...
}

You should also notice that we are writing the secrets/rabbit.pem file into the container twice: The second write is to a file in /tmp as a workaround for the docker-entrypoint.sh script. If we don’t create this file ourselves, the container script will create it by combining the RABBITMQ_SSL_CERTFILE file and RABBITMQ_SSL_KEYFILE file, which will result in an invalid certificate, and a nightmare to figure out…

If the Vault integration in Nomad could write a single generated secret to two separate files, we wouldn’t need this workaround. Alternatively, you could make a custom container with a customised startup script to deal with this for you.

You can see the version of this file with only these changes here

Credentials

Now that we have things running with a certificate, it would be a great idea to start using the Erlang Cookie value and Management UI credentials we stored in Vault earlier. This is a super easy change to support in the Nomad file - we need to add another template block, but this time set env = true which will instruct nomad that the key-values in the template should be loaded as environment variables:

template {
    data = <<EOH
    {{ with secret "secret/data/rabbit/cookie" }}
    RABBITMQ_ERLANG_COOKIE="{{ .Data.data.cookie }}"
    {{ end }}
    {{ with secret "secret/data/rabbit/admin" }}
    RABBITMQ_DEFAULT_USER={{ .Data.data.username }}
    RABBITMQ_DEFAULT_PASS={{ .Data.data.password }}
    {{ end }}
EOH
    destination = "secrets/rabbit.env"
    env = true
}

The complete nomad file with both certificates and credentials can be seen here.

Running!

Now, all we need to do is start our new secure cluster:

nomad job run rabbit/secure.nomad

Client Libraries

Now that you have a secure version of RabbitMQ running, there are some interesting things which can be done with the client libraries. While you can just use the secure port, RabbitMQ also supports Peer Verification, which means that the client has to present a certificate for itself, and RabbitMQ will validate that both certificates are signed by a common CA.

This process can be controlled with two environment variables:

  • RABBITMQ_SSL_VERIFY set to either verify_peer or verify_none
  • RABBITMQ_SSL_FAIL_IF_NO_PEER_CERT set to true to require client certificates, false to make them optional

In .net land, if you are using MassTransit, the configuration looks like this:

var bus = Bus.Factory.CreateUsingRabbitMq(c =>
{
    c.UseSerilog(logger);
    c.Host("rabbitmq://nomad1.mshome.net:5671", r =>
    {
        r.Username("some_application");
        r.Password("some_password");
        r.UseSsl(ssl =>
        {
            ssl.CertificatePath = @"secrets/app.crt";
        });
    });
});

There are also lots of other interesting things you can do with SSL and RabbitMQ, such as using the certificate as authentication rather than needing a username and password per app. But you should be generating your app credentials dynamically with Vault too…

Wrapping Up

Finding all the small parts to make this work was quite a challenge. The Nomad gitter was useful when trying to figure out the certificates issue, and being able to read the source code of the Docker image for RabbitMQ was invaluable to making the Certificate work.

If anyone sees anything I’ve done wrong, or could be improved, I’m happy to hear it!

infrastructure, vagrant, nomad, consul, rabbitmq, vault

---