I wanted to implement canary routing for some HTTP services deployed via Nomad the other day, but rather than having the traffic split by weighting to the containers, I wanted to direct the traffic based on a header.
My first choice of tech was to use Fabio, but it only supports routing by URL prefix, and additionally with a route weight. While I was at JustDevOps in Poland, I heard about another router/loadbalancer which worked in a similar way to Fabio: Traefik.
While Traefik also doesn’t directly support canary routing, it is much more flexible than Fabio, also allowing request filtering based on HTTP headers. Traefik integrates with a number of container schedulers directly, but Nomad is not one of them. It does however also support using the Consul Service Catalog so that you can use it as an almost drop-in replacement for Fabio.
So let’s get to the setup. As usual, there is a complete repository on GitHub: Nomad Traefik Canary Routing.
Nomad
As usual, I am using my Hashibox Vagrant base image, and provisioning it as a single Nomad server and client node, using this script. I won’t dig into all the setup in that, as I’ve written it a few times now.
Consul
Consul is already running on the Hashibox base, so we have no further configuration to do.
Traefik
Traefik can be deployed as a Docker container, and either configured through a TOML file (yay, not yaml!) or with command line switches. As we only need a minimal configuration, I opted to use the command line.
The container exposes two ports we need to care about: 80
for incoming traffic to be routed, and 8080
for the UI, which are statically allocated to the host as 8000
and 8080
for this demo.
The command line configuration used is as follows:
--api
- enable the UI.--consulcatalog
- Traefik has two ways to use Consul ---consul
uses the KV store for service definitions, and--consulcatalog
makes use Consul’s service catalogue.--consulcatalog.endpoint=consul.service.consul:8500
as Consul is not running in the same container as Traefik, we need to tell it where Consul is listening, and as we have DNS Forwarding for*.consul
domains, we use the addressconsul.service.consul
. If DNS forwarding was not available, you could use the Nomad variable${attr.unique.network.ip-address}
to get the current task’s host’s IP.--consulcatalog.frontEndRule
disable the default rule - each service needs to specifytraefik.frontend.rule
.--consulcatalog.exposedByDefault=false
- lastly, we stop Traefik showing all services registered into consul, the will need to have thetraefik.enable=true
tag to be processed.
The entire job file is listed below:
job "traefik" {
datacenters = ["dc1"]
type = "service"
group "loadbalancers" {
count = 1
task "traefik" {
driver = "docker"
config {
image = "traefik:1.7.12"
args = [
"--api",
"--consulcatalog",
"--consulcatalog.endpoint=consul.service.consul:8500",
"--consulcatalog.frontEndRule=''",
"--consulcatalog.exposedByDefault=false"
]
port_map {
http = 80
ui = 8080
}
}
resources {
network {
port "http" { static = 8000 }
port "ui" { static = 8080 }
}
memory = 50
}
}
}
}
We register the job into Nomad, and then start on the backend services we will route to:
nomad job run jobs/traefik.nomad
The Backend Services
To demonstrate the services can be routed to correctly, we can use the containersol/k8s-deployment-strategies
docker container. This image exposes an HTTP service which responds with the container’s hostname and the content of the VERSION
environment variable, something like this:
$ curl http://echo.service.consul:8080
# Host: 23351e48dc98, Version: 1.0.0
We’ll start by making a standard nomad job for this container, and then update it to support canarying. The entire job is listed below:
job "echo" {
datacenters = ["dc1"]
type = "service"
group "apis" {
count = 3
task "echo" {
driver = "docker"
config {
image = "containersol/k8s-deployment-strategies"
port_map {
http = 8080
}
}
env {
VERSION = "1.0.0"
}
resources {
network {
port "http" { }
}
}
service {
name = "echo"
port = "http"
tags = [
"traefik.enable=true",
"traefik.frontend.rule=Host:api.localhost"
]
check {
type = "http"
path = "/"
interval = "5s"
timeout = "1s"
}
}
}
}
}
The only part of interest in this version of the job is the service
stanza, which is registering our echo service into consul, with a few tags to control how it is routed by Traefik:
service {
name = "echo"
port = "http"
tags = [
"traefik.enable=true",
"traefik.frontend.rule=Host:api.localhost"
]
check {
type = "http"
path = "/"
interval = "5s"
timeout = "1s"
}
}
The traefik.enabled=true
tag allows this service to be handled by Traefik (as we set exposedByDefault=false
in Traefik), and traefik.frontend.rule=Host:api.localhost
the rule means that any traffic with the Host
header set to api.localhost
will be routed to the service.
Which we can now run the job in Nomad:
nomad job run jobs/echo.nomad
Once it is up and running, we’ll get 3 instances of echo
running which will be round-robin routed by Traefik:
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost'
#Host: 1ac8a49cbaee, Version: 1.0.0
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost'
#Host: 23351e48dc98, Version: 1.0.0
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost'
#Host: c2f8a9dcab95, Version: 1.0.0
Now that we have working routing for the Echo service let’s make it canaryable.
Canaries
To show canary routing, we will create a second version of the service to respond to HTTP traffic with a Canary
header.
The first change to make is to add in the update
stanza, which controls how the containers get updated when Nomad pushes a new version. The canary
parameter controls how many instances of the task will be created for canary purposes (and must be less than the total number of containers). Likewise, the max_parallel
parameter controls how many containers will be replaced at a time when a deployment happens.
group "apis" {
count = 3
+ update {
+ max_parallel = 1
+ canary = 1
+ }
task "echo" {
Next, we need to modify the service
stanza to write different tags to Consul when a task is a canary instance so that it does not get included in the “normal” backend routing group.
If we don’t specify at least 1 value in canary_tags
, Nomad will use the tags
even in the canary version - an empty canary_tags = []
declaration is not enough!
service {
name = "echo"
port = "http"
tags = [
"traefik.enable=true",
"traefik.frontend.rule=Host:api.localhost"
]
+ canary_tags = [
+ "traefik.enable=false"
+ ]
check {
Finally, we need to add a separate service
stanza to create a second backend group which will contain the canary versions. Note how this group has a different name, and has no tags
, but does have a set of canary_tags
.
service {
name = "echo-canary"
port = "http"
tags = []
canary_tags = [
"traefik.enable=true",
"traefik.frontend.rule=Host:api.localhost;Headers: Canary,true"
]
check {
type = "http"
path = "/"
interval = "5s"
timeout = "1s"
}
}
The reason we need two service
stanzas is that Traefik can only create backends based on the name of the service registered to Consul and not from a tag in that registration. If we just used one service
stanza, then the canary version of the container would be added to both the canary backend and standard backend. I was hoping for traefik.backend=echo-canary
to work, but alas no.
The entire updated jobfile is as follows:
job "echo" {
datacenters = ["dc1"]
type = "service"
group "apis" {
count = 3
update {
max_parallel = 1
canary = 1
}
task "echo" {
driver = "docker"
config {
image = "containersol/k8s-deployment-strategies"
port_map {
http = 8080
}
}
env {
VERSION = "1.0.0"
}
resources {
network {
port "http" { }
}
memory = 50
}
service {
name = "echo-canary"
port = "http"
tags = []
canary_tags = [
"traefik.enable=true",
"traefik.frontend.rule=Host:api.localhost;Headers: Canary,true"
]
check {
type = "http"
path = "/"
interval = "5s"
timeout = "1s"
}
}
service {
name = "echo"
port = "http"
tags = [
"traefik.enable=true",
"traefik.frontend.rule=Host:api.localhost"
]
canary_tags = [
"traefik.enable=false"
]
check {
type = "http"
path = "/"
interval = "5s"
timeout = "1s"
}
}
}
}
}
Testing
First, we will change the VERSION
environment variable so that Nomad sees the job as changed, and we get a different response from HTTP calls to the canary:
env {
- VERSION = "1.0.0"
+ VERSION = "2.0.0"
}
Now we will update the job in Nomad:
nomad job run jobs/echo.nomad
If we run the status command, we can see that the deployment has started, and there is one canary instance running. Nothing further will happen until we promote it:
$ nomad status echo
ID = echo
Status = running
Latest Deployment
ID = 330216b9
Status = running
Description = Deployment is running but requires promotion
Deployed
Task Group Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline
apis false 3 1 1 1 0 2019-06-19T11:19:31Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
dcff2555 82f6ea8b apis 1 run running 18s ago 2s ago
5b2710ed 82f6ea8b apis 0 run running 6m52s ago 6m26s ago
698bd8a7 82f6ea8b apis 0 run running 6m52s ago 6m27s ago
b315bcd3 82f6ea8b apis 0 run running 6m52s ago 6m25s ago
We can now test that the original containers still work, and that the canary version works:
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost'
#Host: 1ac8a49cbaee, Version: 1.0.0
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost'
#Host: 23351e48dc98, Version: 1.0.0
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost'
#Host: c2f8a9dcab95, Version: 1.0.0
$ curl http://traefik.service.consul:8080 -H 'Host: api.localhost' -H 'Canary: true'
#Host: 496840b438f2, Version: 2.0.0
Assuming we are happy with our new version, we can tell Nomad to promote the deployment, which will remove the canary and start a rolling update of the three tasks, one at a time:
nomad deployment promote 330216b9
End
My hope is that the next version of Traefik will have better support for canary by header, meaning I could simplify the Nomad jobs a little, but as it stands, this doesn’t add much complexity to the jobs, and can be easily put into an Architecture Decision Record (or documented in a wiki page, never to be seen or read from again!)