Isolated Docker Multistage Images

01 Nov 2020

Often when building applications, I will use a multistage docker build for output container size and efficiency, but will run the build in two halves, to make use of the extra assets in the builder container, something like this:

docker build \
  --target builder \
  -t builder:$GIT_COMMIT \
  .

docker run --rm \
  -v "$PWD/artefacts/tests:/artefacts/tests" \
  builder:$GIT_COMMIT \
  yarn ci:test

docker run --rm \
  -v "$PWD/artefacts/lint:/artefacts/lint" \
  builder:$GIT_COMMIT \
  yarn ci:lint

docker build \
  --cache-from builder:$GIT_COMMIT \
  --target output \
  -t app:$GIT_COMMIT \
  .

This usually works fine, but sometimes the .dockerignore file won’t have everything set correctly, and docker will decide that when it runs the last build command, that it needs to rebuild the builder container too, which is pretty irritating.

The first solution is to try and figure out what you need to add to your .dockerignore file, which depending on your repository structure and container usage, might be more hassle than it’s worth.

The second solution is to prevent docker invalidating the first layers at all, by splitting the build into separate files.

Splitting the Dockerfile

Let’s start with an example docker file, which is a generic yarn based application with multistage build configured:

FROM node:15.0.1-alpine3.12 as builder
WORKDIR /app

COPY . ./
RUN yarn install --frozen-lockfile && yarn cache clean

RUN yarn ci:build


FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

The first file will be our Docker.builder, which is a direct copy paste:

FROM node:15.0.1-alpine3.12 as builder
WORKDIR /app

COPY . ./
RUN yarn install --frozen-lockfile && yarn cache clean

RUN yarn ci:build

The second file can also be a direct copy paste, saved as Dockerfile.output, but it has a problem:

FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

We want to copy from a different container, not a different stage, and while the COPY command does let you specify another container in the --from parameter, but we really want to specify which container it is at build time. The first attempt at solving this was using a buildarg:

ARG builder_image
COPY --from ${builder_image} /app/dist /app

But alas, this doesn’t work either, as the --from parameter doesn’t support variables. The solution turns out to be that FROM command does support parameterisation, so we can (ab)use that:

ARG builder_image
FROM ${builder_image} as builder

FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

Now our build script can use the --build-arg parameter to force the right container:

docker build \
-  --target builder \
+  --file Dockerfile.builder \
  -t builder:$GIT_COMMIT \
  .

docker run --rm \
  -v "$PWD/artefacts/tests:/artefacts/tests" \
  builder:$GIT_COMMIT \
  yarn ci:test

docker run --rm \
  -v "$PWD/artefacts/lint:/artefacts/lint" \
  builder:$GIT_COMMIT \
  yarn ci:lint

docker build \
-  --cache-from builder:$GIT_COMMIT \
+  --build-arg "builder_image=builder:$GIT_COMMIT" \
+  --file Dockerfile.output \
  -t app:$GIT_COMMIT \
  .

We can now safely modfiy the working directory to our heart’s content without worring about invalidating the layer caches.

Better BASHing Through Technology

28 Aug 2020

I write a lot of bash scripts for both my day job and my personal projects, and while they are functional, bash scripts always seem to lack that structure that I want, especially when compared to writing something in Go or C#. The main problem I have with bash scripts is that when I use functions, I lose the ability to log things.

For example the get_config_path function will print the path to the configuration file, which will get consumed by the do_work function:

get_config_path() {
  echo "Finding Machine Configurtaion"

  if [ -n "$CONFIG_PATH" ] && [ -e "$CONFIG_PATH" ]; then
    # echo "Using Configuration from environment"
    echo "$CONFIG_PATH"
    return
  fi

  if [ -e "~/.config/demo/config.toml" ]; then
    # echo "Using Configuration directory file"
    echo "~/.config/demo/config.toml"
    return
  fi

  # echo "Unable to find configuration"
  exit 1
}

do_work() {
  local -r config=$(get_config_path)

  # actual work...
}

The problem is, if I include the echo statements which are log lines, the config variable in the do_work function will contain them too.

In the past, this has caused me to use functions in bash very sparingly; usually with things that are short that I don’t need to add much logging to. However, I was recently building some AMIs, and happened to be browsing the Consul and Vault terraform module source, which uses several bash scripts which are written in a different style, which makes them vastly more maintainable.

So let’s have a look at the new structure I use, which is heavily based off these scripts.

Better Bash Scripts

Before we get to the implementation, make sure you are using ShellCheck it provides static analysis of your scripts, pointing out many things like missing quotes, or incorrectly handling arrays. It has plugins for all your favourite editors too.

0. General Styles and Constructs

All variables should be declared readonly, and local if possible, to help prevent surprise values from being present if other functions forget to do the same.

#top level
readonly SCRIPT_NAME="$(basename "$0")"

some_method() {
  # within a method
  local -r config_path="$1"
}

Functions should assign their passed arguments to named variables as the first thing they do, preferably matching the variable name they are passed, which helps later when you are searching through a script for usages of “config_file” and not having to find other names/aliases for the same value.

read_config() {
  local -r config_file="$1"
  local -r skip_validation="$2"

  # ...
}

invoke() {
  # ...

  read_config "$config_file" "$skip_validation"
}

1. Error Handling

It should go without saying, but you really need to start your scripts with the following:

#!/bin/bash

set -euo pipefail;

There are many better articles on what these specifically do, but suffice to say:

  • e causes the script to stop on errors
  • u causes it to error on undefined variables being used
  • o pipefail causes a non-zero exit code from any command in a pipeline to fail the script too (rather than just the last command.)

2. Logging

The real advantage of this structure is we get to have log statements! This is achieved by doing all logging to stderr instead of stdout. We use a standardised log function across all the scripts, which also includes the script’s name so when calling other scripts you can see which one wrote the log line:

readonly SCRIPT_NAME="$(basename "$0")"

log() {
  local -r level="$1"
  local -r message="$2"
  local -r timestamp=$(date +"%Y-%m-%d %H:%M:%S")

  >&2 echo -e "${timestamp} [${level}] [$SCRIPT_NAME] ${message}"
}

Invoking the function is log "INFO" "Some status" or log "WARN" "Something concerning" etc.

3. Error Checking

We have some standard assertion functions which are used by the script when starting up to validate arguments:

assert_not_empty() {
  local -r arg_name="$1"
  local -r arg_value="$2"

  if [[ -z "$arg_value" ]]; then
    log "ERROR" "The value for '$arg_name' cannot be empty"
    exit 1
  fi
}

assert_is_installed() {
  local -r name="$1"

  if [[ ! $(command -v "$name") ]]; then
    log "ERROR" "The binary '$name' is required by this script but is not installed or in the system's PATH."
    exit 1
  fi
}

4. Argument parsing

When scripts need to take parameters in, I prefer to use long-flag style, as they are little more readable for people checking invocations again in the future. This function is usually always called run, and is the last function defined, and is invoked immediately after definition, passing in all script arguments (run "[email protected]"):

run() {
  local namespace=""
  local suffix=""
  local dry_run="false"

  while [[ $# -gt 0 ]]; do
    local key="$1"

    case "$key" in
      --namespace)
        namespace="$2"
        shift
        ;;
      --suffix)
        assert_not_empty "$key" "$2"
        suffix="$2"
        shift
        ;;
      --dry-run)
        dry_run="true"
        ;;
      --help)
        print_usage
        exit
        ;;
      *)
        log "ERROR" "Unrecognized argument: $key"
        print_usage
        exit 1
        ;;
    esac

    shift
  done

  # mandatory flag validation
  assert_not_empty "--namespace" "$namespace"

  # make sure tools are installed
  assert_is_installed "vault"
  assert_is_installed "openssl"
  assert_is_installed "jq"

  # do the work!
  local -r cert=$(generate_cert "$suffix")

  store_cert "$namespace" "$cert" "$dry_run"

}

run "[email protected]"

The validation uses the assert_not_empty function defined above, which is used in two ways: after the while loop to check mandatory values have been filled in, and within the case statement for optional flags values.

We also use assert_is_installed to validate that utilities we need are installed, such as vault, openssl and jq

The print_usage function is just a set of echo statements giving all the flags, and an example of invokation:

print_usage() {
  echo
  echo "Usage: $SCRIPT_NAME [OPTIONS]"
  echo
  echo "This script creates a new certiticate, and it installs it into the right namespace"
  echo
  echo "Options:"
  echo
  echo -e "  --namespace\tThe namespace to install the certificate in"
  echo -e "  --suffix\tAn optional suffix for the hostname"
  echo -e "  --dry-run\tDon't install the certificate"
  echo
  echo "Example:"
  echo
  echo "  $SCRIPT_NAME --namespace test --dry-run"
}

Usage

I keep a single template file which has all of this written into it, and new scripts start off with a copy-paste of the template. Could it be DRYer? Sure, but then I have to deal with dependency management, and it’s just not worth the hassle and overhead.

Sharing Docker Layers Between Build Agents

14 May 2020

Recently, I noticed that when we pull a new version of our application’s docker container, it fetches all layers, not just the ones that change.

The problem is that we use ephemeral build agents, which means that each version of the application is built using a different agent, so Docker doesn’t know how to share the layers used. While we can pull the published container before we run the build, this only helps with the final stage of the build. We want to cache the other stages of the build too, as the earlier layers don’t change often, and can be quite slow to build.

We can achieve this by tweaking how we build our stages, which will also allow some other interesting optimisations.

The Dockerfile

An example dockerfile is below. There are two stages, builder and prod. In the case we are looking at, both the OS packages and application dependencies rarely change, but can take quite a while to install.

FROM node:14.2.0-alpine3.11 AS builder
WORKDIR /app

RUN apk add --no-cache make gcc g++ python

COPY package.json yarn.lock ./
RUN yarn install --no-progress --frozen-lockfile && \
    yarn cache clean

COPY ./src ./src
RUN yarn build


FROM node:14.2.0-alpine3.11 AS prod
WORKDIR /app

COPY package.json yarn.lock ./

RUN yarn install --production --no-progress --frozen-lockfile && \
    yarn cache clean

COPY --from=builder /app/dist ./dist

CMD ["yarn", "start"]

The first step is to try and pull both :builder and :latest images. We append || true as the images might not exist yet, and we want the build to pass if they don’t!

docker pull app:builder || true
docker pull app:latest || true

Now that we have the application images locally, we can proceed to building the :builder stage. We tag it twice: once with just app:builder and once with the short-commit that built it.

docker build \
    --cache-from=app:builder \
    --target builder \
    -t app:builder-$COMMIT_SHORT \
    -t app:builder \
    .

Now that we have built our builder stage, we can use this to do lots of other things which require both dependencies and devDependencies, such as running tests and linters, and we could even distribute these tasks to multiple other machines if we wanted extra parallelism:

docker run --rm -it app:builder-$COMMIT_SHORT yarn test
docker run --rm -it app:builder-$COMMIT_SHORT yarn test:integration
docker run --rm -it app:builder-$COMMIT_SHORT yarn lint

Once we are happy with our tests, we can now build the production container, which we do by using the --cache-from directive twice; once with the builder image we just created, and once with the latest version of our application. Note the order of the --cache-from parameters matters; this won’t work if you specify the app:latest before app:builder!

docker build \
    --cache-from=app:builder \
    --cache-from=app:latest \
    -t app:$COMMIT_SHORT \
    -t app:latest \
    .

Now we can publish everything. We always publish the commit tagged version so that separate branch builds can be fetched and tested, and if the branch is master, we publish both the :builder and :latest tags:

docker push app:$COMMIT_SHORT

if [ "$BRANCH" == "master" ]; then
    docker push app:builder
    docker push app:latest
fi

The full build script looks like this:

docker pull app:builder || true
docker pull app:latest || true

docker build \
    --cache-from=app:builder \
    --target builder \
    -t app:builder-$COMMIT_SHORT \
    -t app:builder \
    .

# run these in parallel
docker run --rm -it app:builder-$COMMIT_SHORT yarn test
docker run --rm -it app:builder-$COMMIT_SHORT yarn test:integration
docker run --rm -it app:builder-$COMMIT_SHORT yarn lint

docker build \
    --cache-from=app:builder \
    --cache-from=app:latest \
    -t app:$COMMIT_SHORT \
    -t app:latest \
    .

docker push app:$COMMIT_SHORT

if [ "$BRANCH" == "master" ]; then
    docker push app:builder
    docker push app:latest
fi

Effects

By publishing both our :builder and :latest tags, we can effectively share the layer caches for all build stages across all build agents. As the layers are shared, pulling the images at the beginning of the builds is pretty fast, and the publishes at the end of the build are very, very fast.

The real benefit comes with building our monolith, which now only needs a small layer to be pulled on deployment, rather than all of the layers, which speeds up our deployments by minutes per host.

Service Mesh with Consul Connect (and Nomad)

04 May 2020

When it comes to implementing a new feature in an application’s ecosystem, I don’t like spending my innovation tokens unless I have to, so I try not to add new tools to my infrastructure unless I really need them.

This same approach comes when I either want, need, or have been told, to implement a Service Mesh. This means I don’t instantly setup Istio. Not because it’s bad - far from it - but because it’s extra complexity I would rather avoid, unless I need it.

But what alternatives are there?

In most large systems I have been involved with Consul has been deployed; usually for Service Discovery, Feature Toggles, it’s key-value store, or distributed locking. As Consul has Service Mesh functionality built in, why not use that?

So let’s dive into setting up a Consul Connect based Service Mesh.

Implementing

The demo for this is made up of two parts (taken from HashiCorp’s consul demo repo): a counter and a dashboard. The counter listens for HTTP requests and will return the number of requests it’s handled. The dashboard polls the counter and displays the current count.

All the source code for the demo is in the Consul Connect Example Repository.

Clone the repository, and run the build script to create the apps:

git clone https://github.com/Pondidum/consul-connect-nomad-demo
cd consul-connect-nomad-demo
./apps/build.sh

Local Run

Run the apps locally to prove they work, in two separate terminals:

PORT=9001 ./apps/bin/counter
PORT=9002 ./apps/bin/dashboard

Open http://localhost:9002 to see the counter running.

Start A Cluster

Now we have established our apps actually start, we can create a small Consul cluster. I am using my Hashibox to do this, so you’ll need libvirt and Vagrant installed to do this.

Running vagrant up will spawn three machines, which will form a Consul cluster, which we can now experiment in. Once it is up and running, we can manually register the two applications into Consul’s service mesh to check that our in cluster communication works.

First, the counter service. The script writes a service definition into consul, which, by specifying the connect stanza, indicates this service is to be included in the service mesh. Once this is done, the counter is started (and sent to the background), and a consul connect proxy is started for this service:

curl --request PUT --url http://localhost:8500/v1/agent/service/register \
  --data '{
    "name": "counter",
    "port": 9001,
    "connect": {
      "sidecar_service": {}
    }
  }'

PORT=9001 /vagrant/apps/bin/counter &

consul connect proxy -sidecar-for counter

We can run this script in a new terminal by running this command:

vagrant ssh one -c '/vagrant/scripts/counter.sh'

Finally, we start the dashboard. The script is very similar, in that we write a service definiton into consul, start the service and run a proxy. The only notable difference is the service registation payload itself:

{
  "name": "dashboard",
  "port": 9002,
  "connect": {
    "sidecar_service": {
      "proxy": {
        "upstreams": [
          { "destination_name": "counter", "local_bind_port": 8080 }
        ]
      }
    }
  }
}

As before, it registers a service, and on what port it will be listening on, but in the connect stanza, we specify that we want to connect to the counter, and we want to talk to it on localhost:8080.

In a new terminal, you can run this script like so:

vagrant ssh two -c '/vagrant/scripts/dashboard.sh'

Now that both are up and running, you can open a browser to the dashboard and see it working: http://two.karhu.xyz:9002. Once you are satisfied, you can stop the services by hitting ctrl+c in both terminals…or try running a second counter or dashboard on the third vagrant machine (vagrant ssh three -c '/vagrant/scripts/dashboard.sh')

Nomad

Now that we have seen how to run the services manually let’s see how easy it is to use the service mesh using Nomad.

There are two nomad job definitions in the included project, so let’s look at the counter’s first:

job "counter" {
  datacenters = ["dc1"]

  group "api" {
    count = 3

    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
    }

    task "counter" {
      driver = "exec"

      config {
        command = "/vagrant/apps/bin/counter"
      }

      env {
        PORT = 9001
      }
    }
  }
}

The network stanza is set to bridge mode, which creates us an isolated network between all the services in the group only. In our case, we will have a single counter service and the proxy.

The service stanza is replicating the same functionality we had by writing a service registration into Consul. By specifying the connect part, Nomad knows that it also needs to start a proxy-based on the service stanza’s settings, and will handle starting and stopping this proxy for us.

The task "counter" block uses the exec driver to run the counter app natively on the host, but docker, java, and others are available too.

To run this into our Nomad cluster, we can use the nomad CLI:

export NOMAD_ADDR="http://one.karhu.xyz:4646"

nomad job run apps/counter/counter.nomad

The dashboard’s Nomad job is very similar:

job "dashboard" {
  datacenters = ["dc1"]

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = 9002

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "dashboard" {
      driver = "exec"

      config {
        command = "/vagrant/apps/bin/dashboard"
      }

      env {
        PORT = "${NOMAD_PORT_http}"
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }
    }
  }
}

The network block this time also specifies that we want to expose our service to the public. As we don’t have a static = 9002 in the port definition, Nomad will assign one at random (this is better! You can avoid port clashes with multiple tasks on the same node), we do however specify that we will map to 9002. The rest of the file can use the Nomad variable NOMAD_PORT_http to get this port number, so we don’t have to copy-paste the number everywhere. Similarly, the sidecar_service stanza exposes a variable called NOMAD_UPSTREAM_ADDR_<destination_name>, so we can use that too for our dashboard task’s environment variable values. This means we should only ever need to specify ports in 1 location in a Nomad file.

As with the counter, we can run the job using the CLI:

nomad job run apps/counter/dashboard.nomad

If we want to get the address and port the dashboard is actually running at, it is easiest to go through the UI, but you can also get the information from the console using the Nomad CLI and jq:

allocation_id=$(nomad alloc status -json | jq -r '.[] | select(.JobID == "dashboard") | .ID')

nomad alloc status -json "$allocation_id" \
  | jq -r '.AllocatedResources.Shared.Networks[0] | ( "http://" + .IP + ":" + (.DynamicPorts[] | select(.Label == "http") | .Value | tostring))'

Wrapping Up

With Consul Connect’s supported APIs, there is great flexibility in how you can implement your service mesh; through definition files, through API requests, or through the container orchestrator directly. Couple this with Consul already being in use in most systems I have been involved with, and hopefully you can see why it makes a great way of having a Service Mesh.

Observability Without Honeycomb

15 Mar 2020

Before I start on this, I want to make it clear that if you can buy Honeycomb, you should. Outlined below is how I started to add observability to an existing codebase which already had the ELK stack available, and was unable to use Honeycomb. My hope, in this case, is that I can demonstrate how much value observability gives, and also show how much more value you would get with an excellent tool, such as Honeycomb.

With that said, what is observability, how is it different to logging (and metrics), and why should you care?

If you already know, or would rather skip to the implementation, jump to Implementing with Elastic Search.

What is it?

The term Observability comes from systems engineering and is how you can determine the behaviour of the entire system from its outputs. In our case, this means the events we emit while processing requests. If we look at (my) hierarchy of monitoring, you can see that it starts with logging, with the next steps up being to output structured logs, then centralising those logs (no more SSHing into random hosts), and finally onto events:

logs < structured logs < centralised structured logs < events (observability)

The Problem with Logs and Metrics

With logs, you are writing out many lines as your process runs, which has a few problems, the primary being that you are often looking for data which is absent.

How many times have you been looking through many lines of logs, before realising “oh, the line about cache invalidation is missing, which means…”. It is much harder to notice data which is absent than data which is present, but with an unexpected value.

The second problem is the size of the logs saved. Logs, especially structured ones, contain a lot of useful information, such as request ids, session ids, paths, versions, host data, and anything else interesting. The majority of these fields are repeated for every log entry in the system, and that means they need to be stored and queryable at some point. Often, this is solved by deleting historical data, or sampling at write time, both of which cause data loss, and you are back to trying to notice data which isn’t there.

Metrics exhibit the data loss problem by design. Metrics are deliberately aggregated client-side and then shipped to storage. The numbers you get from metrics can be useful, but when you look at where they come from, it becomes evident that they are just a projection of logs themselves. How many codebases have you read where every loggger.Info("...", props); line is followed (or preceded) by stats.increment("some_counter)?

So What is an Event?

An Event is a structured piece of data, with as much information about the current request in it as possible. The difference is that you emit one event per request per service, if you are doing microservices. You create an event at the beginning of handling a request and send it somewhere at the end of the request (whether successful or unsuccessful).

For things like background tasks, again, emitting one event per execution, and in well-structured monoliths, one event per request per component.

This doesn’t sound like much of a difference, until you start writing your code to add interesting properties to the event, rather than log lines. We want to store as much high cardinality data as possible (so anything unique, or nearly unique), the more of it, the better, as it lets us slice and dice our events by anything at a later time (e.g. by requestid, userid, endpoint paths, url parameters, http method, etc.)

Looking at the caching example mentioned above, before we had this:

func handleRequest(request *Request) {

    now := time.Now()

    if cache[request.UserID] == nil || cache[request.UserID].IsStale(now) {
        logger.Write("Cache miss for user", request.UserID))
        stats.Increment("cache_misses")
        fillCache(cache, request.UserID)
    }

    //...

    stats.set("request_duration", time.Since(now))
}

When the user is in the cache, there is no logline written, which is fine when everything is working. However, when something unexpected happens, like daylight savings time or sudden clock drift, and suddenly all cache entries are never stale. You have a decrease in latency (which looks good), your cache_misses counter goes down (looks good), but your data is older than you realised, and bad things are happening down the line.

If you were to write this function with observability in mind, you might write something like this instead:

func handleRequest(request *Request) {

    ev := libhoney.NewEvent()
    defer ev.Send()

    now := time.Now()
    ev.Timestamp = now
    ev.Add(map[string]interface{}{
        "request_id": request.ID,
        "request_path": request.Path,
        "request_method": request.method,
        "user_id": request.UserID,
        "cache_size": len(cache),
    })

    userData, found := cache[request.UserID]
    ev.AddField("cache_hit", found)

    if !found || userData.IsStale(now) {
        userData = fillCache(ev, cache, request.UserID)
    }

    ev.AddField("cache_expires", userData.CacheUntil)
    ev.AddField("cache_is_stale", userData.IsStale(now))


    //...

    ev.AddField("request_duration_ms", time.Since(now) / time.Millisecond)
}

The resulting event will contain enough information so that in the future when a bug is introduced, you will be able to look at your events and see that yes, while request_duration_ms has gone down and cache_hit has gone up, all the events have cache_is_stale=false with cache_expires times much older than they should be.

So this is the value add of Observability: Answering Unknown Unknowns; the questions you didn’t know you needed to ask.

I won’t cover how to set up and manage the ELK stack (as my opinion is that you should pay someone else to run it. Don’t waste your engineering effort.) I will assume you have a way to get information from stdout of a process into ElasticSearch somehow (I usually use piping to Filebeat, which forwards to LogStash, which processes and pushes into ElasticSearch).

Besides, the code is the important part. This is all written in Go, but I gather you can do similar to NodeJS apps etc. We will use Honeycomb’s [Libhoney-go] package to do the heavy lifting, and supply a custom Transmission. The following is the important part of a custom stdout write (loosely based on libhoney’s WriterSender):

func (w *JsonSender) Add(ev *transmission.Event) {

    ev.Data["@timestamp"] = ev.Timestamp

    content, _ := json.Marshal(ev.Data)
    content = append(content, '\n')

    w.Lock()
    defer w.Unlock()

    w.Writer.Write(content)

    w.SendResponse(transmission.Response{
        Metadata: ev.Metadata,
    })
}

The key difference here is that I am only serialising the .Data property of the Event, and am inserting an extra @timestamp key to make my event structure conform to the standard LogStash pattern.

All that remains to do is configure libhoney to use the custom sender:

libhoney.Init(libhoney.Config{
    Transmission: &JsonSender{Writer: os.Stdout},
    Dataset:      "my-api",
})

Running your service, you would start to see json objects on stdout which look something like this:

{
    "@timestamp": "2020-03-15T14:51:43.041744363+02:00",
    "request_id": "7f46b313-0a37-457c-9727-b6fdc8c87733",
    "request_path": "/api/user/dashboard",
    "request_method": "GET",
    "user_id": "e6baf70f-9812-4cff-94e9-80a308077955",
    "cache_size": 86,
    "cache_hit": true,
    "cache_expires": "2020-03-15T15:02:17.045625680+02:00",
    "cache_is_stale": false,
    "request_duration_ms": 17
}

There are no message fields for you to read, but you can see everything which happened in this method; whether the user was found in the cache, how big the cache was etc.

Now if we push that into ElasticSearch, we can filter by any of the values in the event; in this case, I filtered by user_id and added columns for all the cache properties.

Kibana Screenshot

Now everything is in one place; you can slice and dice your data and figure out what exactly is going on. You can even write some metrics off your event queries if you want!

Improvements & Caveats

The main caveat is that pushing this into ElasticSearch is not as good as what you get from Honeycomb - It is just an improvement on logging messages and enables you to demonstrate the value of observability easily.

Once you’ve demonstrated how useful observability is, the next step is to migrate to Honeycomb and get even more value.

I have written the word Honeycomb a lot in this post (9 times so far), but I want to stress that it is observability that we are after and that Honeycomb is an implementation detail. It also happens to be the only real observability tooling (although Lightstep, kind of.)

And let’s not get started on the “3 pillars of observability” bullshit being peddled by other vendors.