Testing Immutable Infrastructure

01 Jan 2019

In my previous post, I glossed over one of the most important and useful parts of Immutable Infrastructure: Testability. There are many kinds of tests we can write for our infrastructure, but they should all be focused on the machine/service and maybe it’s nearest dependencies, not the entire system.

While this post focuses on testing a full machine (both locally in a VM, and remotely as an Amazon EC2 instance), it is also possible to do most of the same kind of tests against a Docker container. In fact, one of the tools used in this post supports building Docker containers as an output in parallel to the AMIs, so this can also assist in providing a migration path to/from Docker.

As an example, I will show how I built and tested a LogStash machine, including how to verify that the script to create the production machine is valid, that the machine itself has been provisioned correctly, and that the services inside work as expected.

I have published all the source code to GitHub. The examples in this post are all taken from the repository but might have a few bits removed just for readability. Check the full source out if you are interested!

Repository Structure and Tools

When it comes to building anything that you will have lots of, consistency is key to making it manageable. To that end, I have a small selection of tools that I use, and a repository structure I try and stick to. They are the following:

Vagrant - This is a tool for building and managing virtual machines. It can be backed by many different providers such as Docker, HyperV and VirtualBox. We’ll use this to build a local Linux machine to develop and test LogStash in. I use the HyperV provisioner, as that is what Docker For Windows also uses, and HyperV disables other virtualisation tools.

Packer - This tool provides a way to build machine images. Where Vagrant builds running machines, Packer builds the base images for you to boot, and can build multiple different ones (in parallel) from one configuration. We’ll use this to create our AMIs (Amazon Machine Images.)

Jest - This is a testing framework written in (and for) NodeJS applications. Whatever testing tool works best for your environment is what you should be using, but I use Jest as it introduces minimal dependencies, is cross-platform, and has some useful libraries for doing things like diffing json.

The repository structure is pretty simple:

  • scripts/
  • src/
  • test/
  • build.sh
  • logstash.json
  • package.json
  • vagrantfile

The src directory is where our application code will live. If the application is compiled, the output goes to the build directory (which is not tracked in source-control.) The test directory will contain all of our tests, and the scripts directory will contain everything needed for provisioning our machines.

We’ll describe what the use of each of these files is as we go through the next section.

Local Development

To create our virtual machine locally, we will use Vagrant. To tell Vagrant how to build our machine, we need to create a vagrantfile in our repository, which will contain the machine details and provisioning steps.

The machine itself has a name, CPU count, and memory specified. There is also a setting for Hyper-V which allows us to use a differencing disk, which reduces the startup time for the VM, and how much disk space it uses on the host machine.

For provisioning, we specify to run the relevant two files from the scripts directory.

Vagrant.configure("2") do |config|
    config.vm.box = "bento/ubuntu-16.04"

    config.vm.provider "hyperv" do |hv|
        hv.vmname = "LogStash"
        hv.cpus = 1
        hv.memory = 2048
        hv.linked_clone = true
    end

    config.vm.provision "shell", path: "./scripts/provision.sh"
    config.vm.provision "shell", path: "./scripts/vagrant.sh"
end

To keep things as similar as possible between our development machine and our output AMI, I keep as much of the setup script in one file: scripts/provision.sh. In the case of our LogStash setup, this means installing Java, LogStash, some LogStash plugins, and enabling the service on reboots:

#! /bin/bash

# add elastic's package repository
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
sudo apt-get update

# install openjdk and set environment variable
sudo apt-get install openjdk-8-jre -y
JAVA=$(readlink -f $(which java) | sed "s:bin/java::")
echo "JAVA_HOME=$JAVA" | sudo tee --append /etc/environment

#install logstash and plugins
sudo apt-get install logstash -y
/usr/share/logstash/bin/logstash-plugin install logstash-filter-uuid
/usr/share/logstash/bin/logstash-plugin install logstash-filter-prune

sudo systemctl enable logstash.service

Vagrant will automatically mount it’s working directory into the VM under the path /vagrant. This means we can add a second provisioning script (scripts/vagrant.sh) to link the /vagrant/src directory to the LogStash configuration directory (/etc/logstash/conf.d), meaning we can edit the files on the host machine, and then restart LogStash to pick up the changes.

#! /bin/bash
sudo rm -rf /etc/logstash/conf.d
sudo ln -s /vagrant/src /etc/logstash/conf.d

sudo systemctl start logstash.service

Now that we have a vagrantfile, we can start the virtual machine with a single command. Note, Hyper-V requires administrator privileges, so you need to run this command in an admin terminal:

vagrant up

After a while, your new LogStash machine will be up and running. If you want to log into the machine and check files an processes etc., you can run the following command:

vagrant ssh

An argument can also be provided to the ssh command to be executed inside the VM, which is how I usually trigger LogStash restarts (as it doesn’t seem to detect when I save the config files in the src directory):

vagrant ssh -c 'sudo systemctl restart logstash'

Deployment

To create the deployable machine image, I use Packer. The process is very similar to how Vagrant is used: select a base AMI, create a new EC2 machine, provision it, and save the result as a new AMI.

Packer is configured with a single json file, in this case, named logstash.json. The file is split into four parts: variables, builders, provisioners, and outputs. I won’t include the outputs section as it’s not needed when building AMIs.

Variables

The variables property is for all configuration that you can pass to Packer. Their values can come from Environment Variables, CLI parameters, Consul, Vault, and others. In the LogStash example, there are three variables:

{
  "variables": {
    "aws_access_key": "",
    "aws_secret_key": "",
    "ami_users": "{{env `AMI_ACCOUNTS`}}"
  }
}

The aws_access_key and aws_secret_key are known names - unless we specify some value, they will automatically be read from your AWS config (in ~/.aws/), or if running on EC2, from the EC2 machine profile.

The ami_users is a custom variable which will read the AMI_ACCOUNTS environment variable by default. This particular one is used so that I can grant access to the resulting AMI to multiple AWS accounts, which is useful if you’re running in an Organisation with multiple Accounts. For example, if the AMI is built in a common account, and will be deployed into dev, qa and prod accounts, then you would populate the AMI_ACCOUNTS as a CSV of account IDs.

Builders

Packer can build many different kinds of machine image, but for this, we only need one: amazon-ebs.

{
  "builders": [
    {
      "type": "amazon-ebs",
      "access_key": "{{user `aws_access_key`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "region": "eu-west-1",
      "instance_type": "t2.micro",
      "source_ami_filter": {
        "filters": {
          "virtualization-type": "hvm",
          "name": "ubuntu/images/*ubuntu-xenial-16.04-amd64-server-*",
          "root-device-type": "ebs"
        },
        "owners": ["099720109477"],
        "most_recent": true
      },
      "ssh_username": "ubuntu",
      "ami_name": "logstash {{timestamp}}",
      "ami_users": "{{user `ami_users`}}"
    },
  ]
}

The two most interesting properties of this are source_ami_filter and ami_users. The source_ami_filter works in a very similar manner to the AWS CLI’s describe-images --filters parameter, albeit in a more readable format. In this case, I am specifying that I want an ubuntu-xenial base, and I want it to be an official Canonical image, so specify their Account ID as the owner. I also specify the most_recent property, as this filter will return all versions of this AMI which Canonical publish.

The ami_users is what lets me grant access to the AMI from other accounts (rather than just making it public). The property’s value should be an array, but Packer is smart enough to expand the CSV in the user variable into an array for us.

Provisioners

The provisioners array items are executed in the order they are specified. To set up the machine, I use the shell provisioner to create a temporary directory, then the file provisioner to upload the files in the src directory to that temporary directory. Finally a second shell provisioner uploads and runs the scripts/provision.sh and scripts/aws.sh files.

{
  "provisioners": [
    {
      "type": "shell",
      "inline": "mkdir -p /tmp/src"
    },
    {
      "type": "file",
      "source": "./src/",
      "destination": "/tmp/src"
    },
    {
      "type": "shell",
      "scripts": ["./scripts/provision.sh", "./scripts/aws.sh"]
    }
  ]
}

The aws.sh file is very small and does roughly the same thing as the vagrant.sh script, but rather than symlinking the /vagrant directory, it moves the uploaded src directory into the right location for LogStash:

#! /bin/sh

sudo rm /etc/logstash/conf.d/*
sudo cp -r /tmp/src/* /etc/logstash/conf.d

Note that this doesn’t start the LogStash service - this gets done by the UserData when we launch a new instance, as often we need to pass in additional configuration parameters, and don’t want the service running until that has been done.

Running

To create the AMI, we need to invoke packer. If I am running packer on a remote machine via SSH, I run it inside tmux, so that disconnects don’t fail the process:

packer build -var "ami_users=111,222,333" logstash.json

After a while, Packer will finish, leaving you with an output which will include the new AMI ID:

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:

eu-west-1: ami-123123123

We’ll get back to this output later when we create a build script that will also run our tests. Before we get to that, however, let’s look at how we can write tests which target both the local Vagrant machine and the AMI too.

Testing

To test the machines, I am using Jest. There isn’t anything particularly interesting going on in the package.json, other than a few babel packages being installed so that I can use ES6 syntax:

{
  "scripts": {
    "watch": "jest --watch",
    "test": "jest "
  },
  "devDependencies": {
    "babel-core": "^6.26.3",
    "babel-jest": "^23.6.0",
    "babel-preset-env": "^1.7.0",
    "jest": "^23.6.0",
    "regenerator-runtime": "^0.13.1"
  }
}

Packer Configuration Testing

There are a number of tests we can do to make sure our Packer configuration is valid before running it. This includes things like checking the base AMI is from a whitelisted source (such as our accounts, Amazon and Canonical). The test has to handle the possibility of multiple builders, and that some builders might not have a source_ami_filter. It also handles if no owner has been specified at all, which we also consider a “bad thing”:

const ourAccounts = [ "111111", "222222", "333333", "444444" ];
const otherOwners = [ "amazon", "099720109477" /*canonical*/ ];

describe("ami builder", () => {

  it("should be based on a whitelisted owner", () => {
    const allOwners = ourAccounts.concat(otherOwners);
    const invalidOwners = owners => owners.filter(owner => !allOwners.includes(owner));

    const amisWithInvalidOwners = packer.builders
      .filter(builder => builder.source_ami_filter)
      .map(builder => ({
        name: builderName(builder),
        invalidOwners: invalidOwners(builder.source_ami_filter.owners || [ "NO OWNER SPECIFIED" ])
      }))
      .filter(builders => builders.invalidOwners.length > 0);

    expect(amisWithInvalidOwners).toEqual([]);
  });

});

I also test that certain variables (ami_users) have been defined, and have been used in the right place:

describe("variables", () => {
  it("should have a variable for who can use the ami", () => {
    expect(packer.variables).toHaveProperty("ami_users");
  });

  it("should read ami_users from AMI_ACCOUNTS", () => {
    expect(packer.variables.ami_users).toMatch(
      /{{\s*env\s*`AMI_ACCOUNTS`\s*}}/
    );
  });
});

describe("ami builder", () => {
  it("should set the ami_user", () => {

    const invalidUsers = packer.builders
      .map(builder => ({
        name: builderName(builder),
        users: builder.ami_users || "NO USERS SPECIFIED"
      }))
      .filter(ami => !ami.users.match(/{{\s*user\s*`ami_users`\s*}}/));

    expect(invalidUsers).toEqual([]);
  });
})

Other tests you might want to add are that the base AMI is under a certain age, or that your AMI has certain tags included, or that it is named in a specific manner.

Machine Testing

Machine testing is for checking that our provisioning worked successfully. This is very useful, as subtle bugs can creep in when you don’t verify what happens.

For example, a machine I built copied configuration directory to a target location but was missing the -r flag, so when I later added a subdirectory, the machine failed as the referenced files didn’t exist.

So that the tests work with both the Vagrant and Packer built versions, we take in their address and key paths from the environment:

import { spawnSync } from "child_process";
import { createConnection } from "net";

// figure out where to look these up
const host = process.env.LOGSTASH_ADDRESS; // e.g. "172.27.48.28";
const keyPath = process.env.LOGSTASH_KEYPATH; // ".vagrant/machines/default/hyperv/private_key";

We also define two helper methods: one to check if a TCP port is open, and one which uses SSH to execute a command and read the response in the machine:

const execute = command => {
  const args = [`[email protected]${host}`, `-i`, keyPath, command];
  const ssh = spawnSync("ssh", args, { encoding: "utf8" });
  const lines = ssh.stdout.split("\n");

  if (lines[lines.length - 1] === "") {
    return lines.slice(0, lines.length - 1);
  }
  return lines;
};

const testPort = port => new Promise((resolve, reject) => {
  const client = createConnection({ host: host, port: port });

  client.on("error", err => reject(err));
  client.on("connect", () => {
    client.end();
    resolve();
  });
});

We can then add some tests which check the files were written to the right place, that port 5044 is open, and port 9600 is closed:

describe("the machine", () => {

  it("should have the correct configuration", () => {
    const files = execute("find /etc/logstash/conf.d/* -type f");

    expect(files).toEqual([
      "/etc/logstash/conf.d/beats.conf",
      "/etc/logstash/conf.d/patterns/custom.txt"
    ]);
  });

  it("should be listening on 5044 for beats", () => testPort(5044));
  it("should not be listening on 9600", () => expect(testPort(9600)).rejects.toThrow("ECONNREFUSED"));
});

Of course, as we can execute any command inside the machine, we can check pretty much anything:

  • tail the LogStash log and see if it’s got the right contents
  • check if the service is started
  • check the service is enabled on boot
  • check the environment variables been written to the right files

Application Testing

There are two styles of Application Testing: white-box and black-box. White-box will be tests run on the application inside the machine, using minimal external dependencies (preferably none at all), and Black-box will be run on the application from outside the machine, either using direct dependencies, or fakes.

It’s worth noting that both white-box and black-box tests are slow, mostly down to how slow LogStash is at starting up, although only giving it 1 CPU and 2Gb of RAM probably doesn’t help.

Whitebox Testing LogStash

To white-box test LogStash, I use a technique partially based on the Agolo LogStash Test Runner. The process for the tests is to run LogStash interactively (rather than as a service), send it a single event, record the output events, and compare them to an expected output.

The test cases are kept in separate folders, with two files. First is the input file, imaginatively called input.log, which will contain one json encoded event per line. The format needs to match what the result of FileBeat sending an event to LogStash would be. In this case, it means a few extra fields, and a message property containing a string of json. Formatted for readability, the object looks like this:

{
  "@timestamp": "2018-12-27T14:08:24.753Z",
  "beat": { "hostname": "Spectre", "name": "Spectre", "version": "5.3.0" },
  "fields": { "environment": "local", "log_type": "application" },
  "input_type": "log",
  "message": "{\"Timestamp\": \"2018-12-18T17:06:27.7112297+02:00\",\"Level\": \"Information\",\"MessageTemplate\": \"This is the {count} message\",\"Properties\": {\"count\": 4,\"SourceContext\": \"LogLines.GetOpenPurchasesHandler\",\"ApplicationName\": \"FileBeatTest\",\"CorrelationId\": \"8f341e8e-6b9c-4ebf-816d-d89c014bad90\",\"TimedOperationElapsedInMs\": 1000}}",
  "offset": 318,
  "source": "D:\\tmp\\logs\\single.log",
  "type": "applicationlog"
}

I also define an output.log, which contains the expected result(s), again one json encoded event per line. The example pipeline in the repository will emit two events for a given input, so this file contains two lines of json (again, newlines added for readability here):

{
  "source": "D:\\tmp\\logs\\single.log",
  "@version": "1",
  "fields": { "log_type": "application", "environment": "local" },
  "@timestamp": "2018-12-18T15:06:27.711Z",
  "offset": 318,
  "ApplicationName": "FileBeatTest",
  "host": "ubuntu-16",
  "type": "applicationlog",
  "CorrelationId": "8f341e8e-6b9c-4ebf-816d-d89c014bad90",
  "MessageTemplate": "This is the {count} message",
  "Level": "Information",
  "Context": "LogLines.GetOpenPurchasesHandler",
  "TimeElapsed": 1000,
  "Properties": { "count": 4 }
}
{
  "duration": 1000000,
  "timestamp": 1545145586711000,
  "id": "<generated>",
  "traceid": "8f341e8e6b9c4ebf816dd89c014bad90",
  "name": "LogLines.GetOpenPurchasesHandler",
  "localEndpoint": { "serviceName": "FileBeatTest" }
}

To enable sending the lines directly to LogStash (rather than needing to use FileBeat), we define an input.conf file, which configures LogStash to read json from stdin:

input {
  stdin { codec => "json_lines" }
}

And an ouput.conf file which configures LogStash to write the output as json lines a known file path:

output {
  file {
    path => "/tmp/test/output.log"
    codec => "json_lines"
  }
}

The tests need to be run inside the machine itself, so I created a script in the ./scripts directory which will do all the work, and can be run by the execute method in a Jest test. The script stops the LogStash service, copies the current configuration from the ./src directory and the replacement input.conf and output.conf files to a temporary location, and then runs LogStash once per test case, copying the result file to the test case’s directory.

#! /bin/bash

sudo systemctl stop logstash

temp_path="/tmp/test"
test_source="/vagrant/test/acceptance"

sudo rm -rf "$temp_path/*"
sudo mkdir -p $temp_path
sudo cp -r /vagrant/src/* $temp_path
sudo cp $test_source/*.conf $temp_path

find $test_source/* -type d | while read test_path; do
    echo "Running $(basename $test_path) tests..."

    sudo /usr/share/logstash/bin/logstash \
        "--path.settings" "/etc/logstash" \
        "--path.config" "$temp_path" \
        < "$test_path/input.log"

    sudo touch "$temp_path/output.log"   # create it if it doesn't exist (dropped logs etc.)
    sudo rm -f "$test_path/result.log"
    sudo mv "$temp_path/output.log" "$test_path/result.log"

    echo "$(basename $test_path) tests done"
done

sudo systemctl start logstash

To execute this, we use the beforeAll function to run it once - we also pass in Number.MAX_SAFE_INTEGER as by default beforeAll will time out after 5 seconds, and the test.sh is slow as hell (as LogStash takes ages to start up).

Once the test.sh script has finished running, we load each test’s output.log and result.log files, parse each line as json, compare the objects, and print out the delta if the objects are not considered equal:

const source = "./test/acceptance";
const isDirectory = p => fs.lstatSync(p).isDirectory();

const cases = fs
  .readdirSync(source)
  .map(name => path.join(source, name))
  .filter(isDirectory);

describe("logstash", () => {
  beforeAll(
    () => execute("/vagrant/scripts/test.sh"),
    Number.MAX_SAFE_INTEGER);

  test.each(cases)("%s", directoryPath => {
    const expected = readFile(path.join(directoryPath, "output.log"));
    const actual = readFile(path.join(directoryPath, "result.log"));

    const diffpatch = new DiffPatcher({
      propertyFilter: (name, context) => {
        if (name !== "id") {
          return true;
        }

        return context.left.id !== "<generated>";
      }
    });

    const delta = diffpatch.diff(expected, actual);
    const output = formatters.console.format(delta);

    if (output.length) {
      console.log(output);
    }

    expect(output.length).toBe(0);
  });
});

Blackbox Testing LogStash

As the machine has ports open for FileBeat and will send it’s output to ElasticSearch, we can set up a fake HTTP server, send some log events via FileBeat to the VM and check we receive the right HTTP calls to our fake server.

While looking on how to do this, I came across the lumberjack-protocol package on NPM, but unfortunately, it only supports lumberjack v1, and FileBeat and LogStash are now using v2, so you would have to use a local copy of filebeat to do the sending.

Due to the complexity of implementing this, and the diminished return on investment (the other tests should be sufficient), I have skipped creating the Blackbox tests for the time being.

AMI Testing

The final phase! Now that we are reasonably sure everything works locally, we need to build our AMI and test that everything works there too, as it would be a shame to update an Auto Scale Group with the new image which doesn’t work!

All that needs to happen to run the tests against an EC2 instance is to set the three environment variables we used with Vagrant, to values for communicating with the EC2 instance. To do this, we’ll need the EC2 IP Address, the username for SSH, and the private key for SSH authentication.

The first thing our build script needs to do is create the AMI. This is done in the same way as mentioned earlier, but with the slight difference of also piping the output to tee:

packer_log=$(packer build logstash.json | tee /dev/tty)
ami_id=$(echo "$packer_log" | tail -n 1 | sed 's/.*\(ami.*\)/\1/')

By using tee, we can pipe the build log from Packer to both the real terminal (/dev/tty), and to a variable called packer_log. The script then takes the last line and uses some regex to grab the AMI ID.

Next up, the script uses the AWS CLI to launch an EC2 instance based on the AMI, and store it’s IP Address and Instance ID:

json=$(aws ec2 run-instances \
  --image-id "$ami_id" \
  --instance-type t2.small \
  --key-name "$keypair_name" \
  --region eu-west-1 \
  --subnet-id "$subnet_id" \
  --security-group-ids "$security_group_id" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=logstash-verification}]' \
  --user-data "$userdata")

instance_id=$(echo "$json" | jq -r .Instances[0].InstanceId)
private_ip=$(echo "$json" | jq -r .Instances[0].PrivateIpAddress)

The IP Address is then used to set up the environment variables which the node test scripts use to locate the machine:

LOGSTASH_ADDRESS="$private_ip"
LOGSTASH_SSH="ubuntu"
LOGSTASH_KEYPATH="~/.ssh/id_rsa" build ou

npm run test

Finally, the script uses the Instance ID to terminate the instance:

aws ec2 terminate-instances \
  --instance-ids "$instance_id"

Wrapping Up

Hopefully, this (rather long) post is a useful introduction (!) to how I tackle testing Immutable Infrastructure. All of these techniques for testing the machine and application can be used for testing things like Docker containers too (and handily, Packer can be used to create Docker containers also).

As mentioned earlier The Repository is available here.

logstash, microservices, infrastructure, vagrant, packer, aws, testing

---

Code-free tracing with LogStash and Jaeger

22 Dec 2018

I wanted to show request charts (similar to the network tab in firefox) for requests across our microservices but wanted to do so in the least invasive way possible.

Firefox network timeline

We already use LogStash to collect logs from multiple hosts (via FileBeat) and forward them on to ElasticSearch, so perhaps I can do something to also output from LogStash to a tracing service.

app to filebeat to logstash to elasticsearch and opentracing

There are a number of tracing services available (AppDash, Jaeger, Zipkin), but unfortunately LogStash doesn’t have plugins for any of them or for OpenTracing. I picked Jaeger, as it seemed to be the easiest to get up and running, and also had a “Zipkin compatible” API too if you didn’t want to use the Jaeger one.

Setting up and running Jaeger in a production environment is way out of scope for this post, but for testing, I am just using the all-in-one Docker container:

docker run -d --nameron jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 9411:9411 \
  jaegertracing/all-in-one

The tracing APIs work in roughly the same way: all traces have the same traceid, and are sub-divided into spans. Spans have a unique id and can have a parent (span) id set. The trouble with sending data from LogStash is that I don’t have any parent IDs available. Well, at least without going and adding it to all our applications, and I’m not sure how I would want to propagate it correctly either.

So the first question is: can I push multiple spans without any parent IDs, and have it plot them as I would expect.

Is span.parentid required?

As the HTTP API documentation seems to be non-existent (other than “there is one!”), and most of the other docs mention the Thrift API, I decided to use that to test the theory. The Jaeger C# client library has a nice API to use, but I wanted to manually specify things (e.g. traceid, spanid, etc.) and the API has no way to do that, as pretty much everything is marked as internal.

Instead, I cloned the client repository, and added a test (which had access to all the internals I wanted):

public class Scratch
{
    private readonly HttpSender _sender;
    private readonly Tracer _tracer;
    private readonly TraceId _id;

    public Scratch(ITestOutputHelper output)
    {
        _sender = new HttpSender("http://localhost:14268/api/traces");
        _tracer = new Tracer.Builder("FirstService").Build();
        _id = TraceId.NewUniqueId(false);

        output.WriteLine(_id.ToString());
    }

    [Fact]
    public async Task Send()
    {
        var start = DateTime.UtcNow;

        await Create(start, start.AddSeconds(5));
        await Create(start, start.AddSeconds(3));

        await Create(start.AddSeconds(5), start.AddSeconds(10));
        await Create(start.AddSeconds(10), start.AddSeconds(20));
        await Create(start, start.AddSeconds(20));

        await _sender.FlushAsync(CancellationToken.None);
    }

    private async Task Create(DateTime start, DateTimeOffset finish)
    {
        var context = new SpanContext(_id, SpanId.NewUniqueId(), new SpanId(0), SpanContextFlags.None);
        var span = new Span(_tracer, "test", context, start, new Dictionary<string, object>(), null);
        span.Finish(finish);

        await _sender.AppendAsync(span, CancellationToken.None);
    }
}

The graph rendered how I hoped, proving that I don’t need to set the span’s parentid for everything with the same traceid to work correctly!

How to generate a consistent TraceID?

All our services pass along or generate a correlationID for each request, so my idea here was to do some conversion to put it in the traceid, which is an int64. Our correlationIDs are guids, which internally is an int128, so I figured I would take the least significant 8 bytes and use that. This can be done in C# like so:

var correlationID = Guid.NewGuid();
var traceID = new TraceId(BitConverter.ToInt64(correlationID.ToByteArray(), 8));

I noticed, however, that the TraceID structure in the Jaeger client can take two values: int64 low and int64 high, so I can use the entire correlationid.

How to ship from LogStash?

After reading through the LogStash Output Plugins docs it looked like there were three possible ways to ship the events to Jaeger: writing a command line app; invoked through either exec or pipe, writing a LogStash plugin, or doing something with the http output.

CLI adaptor

The idea of this method would be to write a CLI app which takes an event from LogStash and then sends it to Jaeger using the Thrift API. The app itself could be invoked using the exec or pipe outputs. I didn’t want to use this method for a few reasons:

Development: The Jaeger client has things I need hidden from the public surface, such as being able to specify the traceid, rather than generating a new one each time, so I would either have to reimplement the classes I need, use reflection to instantiate things, or try opening Pull Requests to expose more internals of the library.

Maintenance: I’d rather not have yet another thing to maintain if I can avoid it, not to mention all the additional things like repository location, build agents, etc.

Deployment: Our LogStash box is a plain Ubuntu box with nothing other than Java and LogStash installed on it. As I would probably end up using C# to write the app, I’d need to add even more dependencies.

Throughput: Jaeger (and the others) support sending traces in batches for better efficiency. Writing an app which supported that would involve writing the batching and caching, and the likelihood is I would mess it up, and increase the chances of lost messages.

LogStash plugin

Basically has the same problems as above, with the addition of me needing to learn enough about Java to build something reliable, and as this is a test and no one else in my team really knows Java either, so I ruled this out.

Http?

The Jaeger collector process can also expose a Zipkin compatible API, and while the documentation for that API is also terrible, I was able to find a Pull Request on Github with some structure defined. I figured I could do most of this directly with LogStash’s configuration DSL, assuming I can: generate a uuid (logstash-filter-uuid plugin), and remove all extra stuff from the post body (logstash-filter-prune).

Building the LogStash message

The first step of my configuration is based on what we need to do to Serilog messages to make them a bit nicer in Kibana:

Pre-processing

json {
    source => "message"
}

date {
    match => [ "Timestamp", "ISO8601", "yyyy-MM-dd'T'HH:mm:ss.SSSSSSSZZ"]
    target => "@timestamp"
}

mutate {
    rename => {
        "[Properties][ApplicationName]" => "ApplicationName"
        "[Properties][CorrelationId]" => "CorrelationId"
        "[Properties][SourceContext]" => "Context"
        "[Properties][TimedOperationElapsedInMs]" => "TimeElapsed"
    }
}

The json block tells LogStash that everything in the message property is a json object, rather than a string.

The date block extracts the time the message was recorded by the application and writes it to the special @timestamp field. This is a good idea as due to queuing, and processing times, the time the message was written and the time it was processed by LogStash will be different.

Lastly, the mutate block is moving a bunch of fields out of the message properties, and up to the top level of the object. We query by things like ApplicationName a lot, and it’s nice not to have to write Properties.ApplicationName:blahblah all the time.

Splitting

Our messages contain a lot of data which for the purposes of tracing, we just don’t care about. Not only is it a waste of bandwidth to send all the data to Jaeger, but also there could be clashing fields, and if we ever decide to use a hosted version of Jaeger, I don’t want to be sending them all the data either.

So how do we remove everything from the message, but still keep it in the message for the other outputs?

LogStash has an interesting filter called clone. What it does is copy the message as many times as you specify, and then runs the remaining part of the pipeline once per message, including the original message. In code, this just looks like this:

clone {
    clones => [ "jaeger" ]
}

If we wanted to split the message more times, we can just add more types to the clones array; there will be one extra message produced for each entry in the array, which will also be added to the message as the type property so we can filter on it later.

We don’t want to send the type field itself to Jaeger (or ElasticSeach), so we move it from the message to the @metadata object, which doesn’t get passed to outputs.

mutate {
    rename => { "type" => "[@metadata][type]" }
}

To handle the messages differently, we can just use an if statement on the @metadata.type property:

if [@metadata][type] == "jaeger" {
    # do stuff to the jaeger message
} else {
    # process the original message
}

Building the Http Payload

The hardest part of getting this to work was handling the dates. Jaeger specifies that timestamps should come in microseconds since epoch, and duration should be in microseconds also. We need to use the ruby filter so we can run some code to do this, rather than using an inbuilt filter primitive.

The duration is easy enough, as we report TimeElapsed in milliseconds:

ruby {
    code => "event.set('duration', (event.get('TimeElapsed') * 1000).floor)"
}

To handle the timestamp we need to convert it to epoch milliseconds. We can get the fractional seconds since the epoch by calling to_f on the @timestamp field, and then multiply by 1000 a couple of times to express it as microseconds. We also floor it to remove the leftover fraction:

ruby {
    code => "event.set('timestamp', (event.get('@timestamp').to_f * 1000 * 1000).floor)"
}

This on its own will work, but it won’t be right. We report log entries when an operation completes, which means our timestamp is the finish of the event, not the start. To solve this, we subtract the duration off the timestamp:

ruby {
    code => "event.set('timestamp', (event.get('@timestamp').to_f * 1000 * 1000).floor - event.get('duration'))"
}

Next, we need to create our span id. We need to generate a random id, consisting of letters and numbers, so the easiest way is to generate a uuid, and then tweak it to fit the Jaeger/Zipkin format. To do that we need to install the logstash-filter-uuid plugin:

/usr/share/logstash/bin/logstash-plugin install logstash-filter-uuid

There are a few things we need to do to generate a valid span id (for Jaeger):

  1. generate a uuid
  2. remove the dashes
  3. make it lowercase
  4. trim it to 16 chars

We use four filters to achieve this:

uuid {
    target => "id"
}
mutate {
    gsub => [ "id", "-", "" ]
    lowercase => [ "id" ]
}
mutate {
    gsub => [ "id", "(^.{1,16}).*$", "\1"]
}

The reason we specify the mutate twice is that order of the same operation (gsub in this case) is not guaranteed within a mutate block.

Next up we need to tweak the traceid from a guid to a uuid:

mutate {
    gsub => [ "traceid", "-", "" ]
    lowercase => [ "traceid" ]
}

Next, we can populate some other context into the message, by renaming the other fields we want. I discovered most of the fields and their formats from a combination of the thrift dtos and the this pull request.

mutate {
    rename => {
        "CorrelationId" => "traceid"
        "Context" => "name"
        "ApplicationName" => "[localEndpoint][serviceName]"
    }
}

Finally, we want to trim all the other properties from our message, which we can do by using the logstash-filter-prune plugin:

prune {
    whitelist_names => [ "id", "traceid", "name", "timestamp", "duration", "localEndpoint" ]
}
mutate {
    remove_field => [ "@timestamp" ]
}

We have to explicitly remove the @timestamp property as the prune plugin seems to ignore it (it also ignores the @metadata property).

The complete filter code can be made a little more efficient by combining several of the mutate blocks, provided we keep things in the right order, as follows:

ruby {
    code => "event.set('duration', (event.get('TimeElapsed') * 1000).floor)"
}

ruby {
    code => "event.set('timestamp', (event.get('@timestamp').to_f * 1000 * 1000).floor - event.get('duration'))"
}

uuid {
    target => "id"
}

mutate {
    rename => {
        "CorrelationId" => "traceid"
        "Context" => "name"
        "ApplicationName" => "[localEndpoint][serviceName]"
    }
    remove_field => [ "@timestamp" ]
}

mutate {
    gsub => [ "id", "-", "" ]
    gsub => [ "traceid", "-", "" ]
    lowercase => [ "traceid", "id" ]
}

mutate {
    gsub => [ "id", "(^.{1,16}).*$", "\1"]
}

prune {
    whitelist_names => [ "id", "traceid", "name", "timestamp", "duration", "localEndpoint" ]
}

Publishing to Jaeger

When publishing to the Jaeger API, you have to send it an array of json objects. To achieve this in the http output plugin, we can enable the format => json_batch, which will cause the HTTP client to buffer some events and then send them all at once. So more efficiency, and no extra code on our part.

To help make testing easier, I publish to two different HTTP endpoints: the Jaeger docker container running on my host machine, and to a Request Logger Bin; where I can inspect the response if something is reported as a failure by Jaeger, and replay it (via the Insomnia REST client) and check error messages returned from Jaeger.

output {
    if [@metadata][type] == "jaeger" {
        http {
            url => "http://192.168.0.115:9411/api/v2/spans"
            format => "json_batch"
            http_method => "post"
        }

        http {
            url => "http://requestloggerbin.herokuapp.com/bin/ff3da5b2-ec57-4611-ada8-b360ab38830c"
            format => "json_batch"
            http_method => "post"
        }
    } else {
        # normal message handling: elasticsearch etc.
        stdout { codec => rubydebug }
    }
}

I also used LogStash’s log output by tailing the log file:

tail -f /var/log/logstash/logstash-plain.log

Eventually, I had a working LogStash configuration and can view application traces in Jaeger.

Problems

One thing I cannot figure out how to do is sending the tags property to Jaeger. The Zipkin (and OpenTracing) APIs specify tags to be an object, e.g.

{
  "traceid": "1ee98d32185e41faba146c151ce8e27d",
  "id": "8377a6dae87947d7",
  "timestamp": 1545321238784000,
  "duration": 409000,
  "localEndpoint": { "serviceName": "PersonInformationService" },
  "name": "SomeOperation",
  "tags": {
    "CorrelationID": "1ee98d32-185e-41fa-ba14-6c151ce8e27d",
    "db.name": "test-blahblah"
  }
}

But LogStash wants the tags property to be an array of strings, and even using the remove_field on tags and then manually populating it doesn’t work. Currently, I don’t need anything in the tags, but it would be nice to have it working. I should ask on the Elastic forums perhaps.

Wrapping Up

This post turned out to be a lot longer than I was expecting, but hopefully, it was interesting. Next time I’ll try and go through how I do testing of immutable infrastructure, as well as local development of the base images.

jaeger, logstash, microservices, infrastructure

---

Against SemVer

16 Dec 2018

Well, for Applications & Services at least. For libraries, SemVer is the way to go, assuming you can agree on what a breaking change is defined as.

But when it comes to Applications (or SaaS products, websites, etc.) SemVer starts to break down. The problem starts with the most obvious: What is a breaking change? How about a minor change?

What’s in a change?

For example, if we were to change the UI of a web application, which caused no backend changes, from the user perspective it is probably a breaking change, but not from the developers perspective. What about changing a backend process, for example, the way the service is billed? How about adding a new step in the middle of an existing process?

These are all hard questions to answer, and I imagine there are many many edge cases and things which are not clear as to what level of change they are.

Clash of the Versions

The next problem stems from the fact that we don’t (often) do Trunk Based Development for applications. We have a long-lived (1-2 weeks) feature branch, which might get pushed to a test environment multiple times as tasks are completed. If we SemVer these deployments, when a bug fix happens on the master branch, we can end up with a version clash, and can’t deploy.

branching, showing a clash of SemVer by having concurrent branches

While this is a problem, we can solve it easily - we can use the -pre or -beta suffix for the feature branch, and then remove the suffix and increment the version number when the feature is fully deployed. This, however, is adding a little more complexity to the process - mostly on the human side of things this time.

I would rather avoid the complexity (machine and human) entirely so instead opt for a different solution: Dates.

How about a Date?

Date stamps to the rescue! Our builds now use the following format:

<year>.<month>.<day>.<build_number>

We did consider using the time (represented as .hhmm) instead of the build_number, but it would have been possible to clash if two builds triggered at the same minute, and the build_number is guaranteed uniqueness. By using an automatic format, we gain a few advantages:

  • No confusion on whether a change is a major, minor, or patch
  • No clashes on multiple branches
  • No human input required (marking things as pre or beta)
  • Answers the “when was this built” question

What about the libraries?

The libraries have a very different set of requirements, and one versioning scheme doesn’t seem to fit both very well, so there is no point trying to force the matter.

code, ci, cd

---

Stopping Caring...

08 Dec 2018

…about GitHub open source commit streak.

This is, I think, partially triggered by Marc Gravell’s post. I currently have had a GitHub commit streak going on 1878 days. The other night I realised, that I don’t care about it any more, and more so, I’m not sure why I did to start with.

I didn’t even mean to start doing it. I just noticed one day that I had done something every day for a couple of weeks, and vaguely wondered how long I could keep that up for. It turns out the answer is “quite a while”. For the most part, it was easy to do - just a small commit a day, it only takes a few minutes right? But more and more often, I noticed that it was having a negative effect on what I was writing.

For instance, I wanted to learn about Nomad, but had been putting it off, as it wouldn’t involve making any public commits. So each time I came to start, I couldn’t until I had done a public commit, and once that commit had was done, so was my motivation. Which meant learning Nomad didn’t happen for a long time.

It also affected what I did with old repositories on GitHub - I wanted to remove some which serve no useful purpose any more, but I couldn’t do that either, as that would remove their commits from the streak. Likewise, I have a new version of my TwelveFactor repository, but I really should remove the old version. But doing so would remove the commits from the streak too. So, as you might have guessed, I haven’t done that either. Maybe soon. I have a lot of writing to go with it too.

On the subject of writing, that also suffered. It takes a lot of effort to sit down and start writing a post, but usually, the words flow pretty smoothly once I am going. But knowing I need to spare some energy for the commit of the day makes it even harder to start writing, so I just haven’t been. There are a lot of drafts with bullet-point lists waiting though, so maybe they will begin to see the light of day soon.

The irony is not lost on me that making this post will increase the commit count (GitHub pages based blog), so I am not committing (or finishing) this post until at least a day has passed.

I decided on the 7th December 2018 not to make any commits. After that had happened - the act of deciding mind you, not the day itself, I felt a lot better. I do not need to do anything. I can go and finish the book a friend recommended me. I can spend my evenings learning about things I want to. Or just do nothing some evenings.

No more worrying about public commits. It’s just not worth it.

code, books, health

---

Microservices or Components

28 Oct 2018

One of the reasons people list for using MicroServices is that it helps enforce separation of concerns. This is usually achieved by adding a network boundary between the services. While this is useful, it’s not without costs; namely that you’ve added a set of new failure modes: the network. We can achieve the same separation of concerns within the same codebase if we put our minds to it. In fact, this is what Simon Brown calls a Modular Monolith, and DHH calls the Majestic Monolith.

We recently needed to expand an existing service to have some new functionality. The current process looks something like this, where the user has done something which will eventually return them a URL which can be clicked to get to a web page to see the results.

api call does some work, returns a result_url which points to a web interface

The new process is an additional authentication challenge which the user will need to complete before they can get to the final results page. The new process looks like this:

api call does work, makes a request to challenge API, passing the result_url as an argument.  The challenge-response returns a challenge_url, which is returned to the user instead of the return_url

Design Decisions

Currently, the challenge functionality will only be used by this one service, but there is a high probability that we will need it for other services in the future too. At this point we have a decision to make: do we keep this functionality in-process, or make a separate microservice for it?

Time To Live

The first trade-off is time: it is slightly quicker to make it in-process, but if we do want to use this from somewhere else later, we’ll need to extract it; which is more work. The key here is “if” - we don’t know for sure that other services will need this exact functionality.

If we keep the new API and UI within the existing API and UI projects, we can also make some code reuse: there is a data store, data access tooling, permissions, styles that can be reused. Also, all of our infrastructure such as logging and monitoring is already in place, which will save us some time too.

API Risk

We want to avoid deploying a service which then needs to undergo a lot of rework in the future if the second and third users of it have slightly different requirements. If we build it as a separate service now, will we be sure we are making something which is generic and reusable by other services? Typically you only get the answer to this question after the second or third usage, so it seems unlikely that we would get our API design perfect on the first attempt.

Technical Risks

If we are to go the separate service route, we are introducing new failure modes to the existing API. What if the challenge API is down? What if the request times out? Are we using HTTP or a Message Broker to communicate with it?

If we keep the service in-process to start with we can eliminate all of these concerns. Luckily, we tend to have very thin controllers and make use of Mediatr, so the actual implementation of how the remote call is made can be hidden in the message handler to a certain extent.

Technical Decisions

As alluded to in the Time To Live point, we can reuse the existing data store and data access code, but this is a tradeoff in itself: what if the current storage tech is not quite ideal for the new requirements?

If the current service makes use of a complex Entity Framework model, but the new service is so simple that Dapper makes more sense, do we introduce the new dependency or not? What if we wanted to migrate away from one datastore to another (e.g. removing all MongoDB usage in favour of Postgres), but this is already using Mongo? We’d be increasing our dependency on a datastore we are explicitly trying to migrate away from.

All this assumes we want to write the service in the same programming language as the existing service! In our case we do but it’s worth considering if you have multiple languages in use already.

Finally on the data storefront, if we decide to extract this as a separate service later, we will have to take into account data migrations, and how we can handle that with little if any, downtime.

The Decision

After weighing up all these points (and a few others), we decided to keep the service inside the existing services. The Challenge API will live in its own area in the current API, and likewise, the Challenge UI will live in its own area in the existing UI.

How do we go about keeping it all separated though?

  • Communication we discuss all changes we want to make anyway, so the first line of defence to preventing the code becoming tightly coupled are these discussions.
  • Pull Requests someone will notice you are doing something which is reducing the separation, and a discussion about how to avoid this will happen.
  • Naming Conventions the Challenge API shares no naming of properties with the existing API. For example, the current API passes in a results_url and results_id, but the Challenge API stores and refers to these as the redirect_url and external_id.
  • Readme it’ll go into the repository’s readme file, along with any other notes which developers will find useful. The sequence diagrams we drew (with much more detail) will also go in here.

Technical Debt?

The final question on this decision is “Isn’t this technical debt we are introducing?”. The answer I feel is “no”, it feels much closer to applying the YAGNI Principle (You Ain’t Gonna Need It). While there is work in the backlog which can use a Challenge API at the moment, that doesn’t necessarily mean it will still be there next week, or if it will be pushed further back or changed later.

In the end, the meeting where we came up with this and drew things on the whiteboard together was productive, and likely much shorter than it took me to write all this down. We were able to resist the “cool hip microservice” trend and come up with a design which is pretty contained and composable with other systems in the future.

If after all this discussion we decided to go the MicroService route, I would still be happy with the decision, as we would have all this material to look back on and justify our choice, rather than waving our hands about and shouting “but microservices” loudly.

How do you go about designing systems? Microservice all the things? Monolith all the things? Or something in between which makes the most sense for the situation at hand?

architecture, microservices, design

---