Observability with Infrastructure as Code

01 Mar 2021

This article was originally published on the Pulumi blog.

When using the Pulumi Automation API to create applications which can provision infrastructure, it is very handy to be able to use observability techniques to ensure the application functions correctly and to help see where performance bottlenecks are.

One of the applications I work on creates a VPC and Bastion host and then stores the credentials into a Vault instance. The problem is that the “create infrastructure” part is an opaque blob, in that I can see it takes 129 seconds to create, but I can’t see what it’s doing, or why it takes this amount of time.

honeycomb traces of one pulumi stack resource

So can I do better?

The Initial Application

In this example I use Honeycomb’s Go Beeline to capture all the data I care about; durations, errors, any context which is “interesting”:

func main() {

	beeline.Init(beeline.Config{
		WriteKey: os.Getenv("HONEYCOMB_API_KEY"),
		Dataset:  "pulumi-demo",
	})
	defer beeline.Close()

	ctx, span := beeline.StartSpan(context.Background(), "basic-vpc")
	defer span.Send()

	name := auto.FullyQualifiedStackName(os.Getenv("PULUMI_USERNAME"), "basic-vpc", "dev")
	stack, err := auto.UpsertStackInlineSource(ctx, name, "basic-vpc", func(pc *pulumi.Context) error {

		azs, err := getAvailabilityZones(ctx)
		if err != nil {
			beeline.AddField(ctx, "err", err)
			return err
		}

		v, err := vpc.NewVpc(ctx, pc, "dev", &vpc.VpcArgs{
			Description:           "dev",
			BaseCidr:              "192.168.0.0/16",
			AvailabilityZoneNames: azs,
			S3Endpoint:            true,
			DynamoEndpoint:        true,
		})
		if err != nil {
			beeline.AddField(ctx, "err", err)
			return err
		}
	})

	if err != nil {
		beeline.AddField(ctx, "err", err)
		os.Exit(1)
	}

	if err := stack.SetConfig(ctx, "aws:region", auto.ConfigValue{Value: os.Getenv("PULUMI_REGION")}); err != nil {
		beeline.AddField(ctx, "err", err)
		os.Exit(1)
	}

	ws := stack.Workspace()
	if err := ws.InstallPlugin(ctx, "aws", "v3.23.0"); err != nil {
		beeline.AddField(ctx, "err", err)
		os.Exit(1)
	}

	if _, err := stack.Refresh(ctx); err != nil {
		beeline.AddField(ctx, "err", err)
		os.Exit(1)
	}

	stream := optup.ProgressStreams(os.Stdout)
	if _, err := stack.Up(ctx, stream); err != nil {
		beeline.AddField(ctx, "err", err)
		os.Exit(1)
	}

	//vault code

}

Adding Infrastructure Observability

To get a handle on what is happening when stack.Up() runs, I have mplemented a custom io.Writer, which will be passed into the ProgressStream constructor.

The custom progress stream’s Write method is called once for each line emitted, which allows us to start new spans when a resource starts being constructed, and send them when construction completes. Currently, this is achieved by parsing the console output text, but I gather in the future, it will be possible to get streamed json blobs which can be unmarshaled into go structs.

type pulumiBeeline struct {
	ctx      context.Context
	contexts map[string]func()
}

func NewPulumiBeeline(ctx context.Context) *pulumiBeeline {
	return &pulumiBeeline{
		ctx:  	ctx,
		contexts: map[string]func(){},
	}
}

func (cw *pulumiBeeline) Write(p []byte) (n int, err error) {

	// todo: make more robust, support modifications, deletions etc.
	line := strings.TrimSpace(string(p))
	parts := strings.Split(line, " ")
	if len(parts) < 5 {
		return len(p), nil
	}

	//+  aws-vpc dev creating
	//+  <type> <name> <action>
	resourceType := parts[2]
	resourceName := parts[3]
	resourceAction := parts[4]

	if resourceAction == "creating" {
		c, s := beeline.StartSpan(cw.ctx, resourceName)
		beeline.AddField(c, "type", resourceType)
		// add other things here

		cw.contexts[resourceName] = s.Send
	}

	if resourceAction == "created" {
		cw.contexts[resourceName]()
	}

	return len(p), nil
}

Modifying the optup.ProgressStreams is the only change needed to the original application:

stream := optup.ProgressStreams(os.Stdout, NewPulumiBeeline(ctx))
if _, err := stack.Up(ctx, stream); err != nil {
	beeline.AddField(ctx, "err", err)
	os.Exit(1)
}

Now when I run this program again, I can see a lot more information in my Honeycomb traces, which not only shows me that Pulumi is highly parallelised, but also gives me a better idea of where the time is taken when creating infrastructure; in this example, it’s the NAT Gateways:

honeycomb traces of all infrastructure resources in the pulumi stack

In the future, I want to expand this to cover far more details, such as including the reasons resources were created/modified/destroyed and including as much information as possible about what caused a resource to fail.

Wrapping Up

In the end, this turned out to be much easier to achieve than I had hoped. Being able to use Pulumi progmatically, rather than running os.Exec directly myself was a huge productivity boost.

I am looking forward to all the new kinds of tooling I can build to solve my user’s problems continuing to utilise Honeycomb for my observability and Pulumi for my infrastructure.

Forking Multi Container Docker Builds

03 Nov 2020

Following on from my last post on Isolated Multistage Docker Builds, I thought it would be useful to cover another advantage to splitting your dockerfiles: building different output containers from a common base.

The Problem

When I have an application which when built, needs to have all assets in one container, and a subset of assets in a second container.

For example, writing a node webapp, where you want the compiled/bundled static assets available in the container as a fallback, and also stored in an nginx container for serving. One of the reasons to do this is separation of concerns: I don’t want to put my backend code where it doesn’t need to be. There is also, in this case, the fact that the backend code and nginx version need different base containers, meaning deploying the same container twice won’t work.

So let’s see how we solve this!

Creating Separate Dockerfiles

The first dockerfile to write is the common base, which I name Dockerfile.builder. This is the same as the previous post - we are assuming that the yarn ci:build step transpiles the typescript, and generates the static assets for our application.

FROM node:15.0.1-alpine3.12 as builder
WORKDIR /app

COPY . ./
RUN yarn install --frozen-lockfile && yarn cache clean

RUN yarn ci:build

Next up is the server container, which will be in the Dockerfile.backend file, as try to name the files based on their purpose, rather than their technology used. As in the previous post, this installs the production dependencies for the application, and copies in the compiled output from the builder stage:

ARG builder_image
FROM ${builder_image} as builder

FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

Now let’s deal with the Dockerfile.frontend. This uses nginx:1.19.3-alpine as a base, and copies in the nginx.conf file from the host, and the static assets directory from the builder container:

ARG builder_image
FROM ${builder_image} as builder

FROM nginx:1.19.3-alpine as output

COPY ./nginx.conf /etc/nginx/nginx.conf
COPY --from builder /app/dist/static /app

Building Containers

The reason we rely on the builder stage rather than the backend output stage is that we are now decoupled from layout/structural changes in that container, and we gain the ability to run the builds in parallel too (the & at the end of the lines), for a bit of a speed up on our build agents:

version="${GIT_COMMIT:0:7}"
builder_tag="builder:$version"

docker build --file Dockerfile.builder -t "$builder_tag" .

# run the builder container here to do tests, lint, static analysis etc.

docker build --file dockerfile.backend --build-arg "builder_image=$builder_tag" -t backend:$version . &
docker build --file Dockerfile.frontend --build-arg "builder_image=$builder_tag" -t frontend:$version . &

wait

The result of this is 3 containers, all labled with the short version of the current git commit:

  • builder:abc123e - contains all packages, compiled output
  • backend:abc123e - node based, contains the node backend and static assets
  • frontend:abc123e - nginx based, contains the static assets

I can now publish the builder internally (so it can be cloned before builds for caching and speed), and deploy the backend and frontend to their different locations.

Isolated Docker Multistage Images

01 Nov 2020

Often when building applications, I will use a multistage docker build for output container size and efficiency, but will run the build in two halves, to make use of the extra assets in the builder container, something like this:

docker build \
  --target builder \
  -t builder:$GIT_COMMIT \
  .

docker run --rm \
  -v "$PWD/artefacts/tests:/artefacts/tests" \
  builder:$GIT_COMMIT \
  yarn ci:test

docker run --rm \
  -v "$PWD/artefacts/lint:/artefacts/lint" \
  builder:$GIT_COMMIT \
  yarn ci:lint

docker build \
  --cache-from builder:$GIT_COMMIT \
  --target output \
  -t app:$GIT_COMMIT \
  .

This usually works fine, but sometimes the .dockerignore file won’t have everything set correctly, and docker will decide that when it runs the last build command, that it needs to rebuild the builder container too, which is pretty irritating.

The first solution is to try and figure out what you need to add to your .dockerignore file, which depending on your repository structure and container usage, might be more hassle than it’s worth.

The second solution is to prevent docker invalidating the first layers at all, by splitting the build into separate files.

Splitting the Dockerfile

Let’s start with an example docker file, which is a generic yarn based application with multistage build configured:

FROM node:15.0.1-alpine3.12 as builder
WORKDIR /app

COPY . ./
RUN yarn install --frozen-lockfile && yarn cache clean

RUN yarn ci:build


FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

The first file will be our Docker.builder, which is a direct copy paste:

FROM node:15.0.1-alpine3.12 as builder
WORKDIR /app

COPY . ./
RUN yarn install --frozen-lockfile && yarn cache clean

RUN yarn ci:build

The second file can also be a direct copy paste, saved as Dockerfile.output, but it has a problem:

FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

We want to copy from a different container, not a different stage, and while the COPY command does let you specify another container in the --from parameter, but we really want to specify which container it is at build time. The first attempt at solving this was using a buildarg:

ARG builder_image
COPY --from ${builder_image} /app/dist /app

But alas, this doesn’t work either, as the --from parameter doesn’t support variables. The solution turns out to be that FROM command does support parameterisation, so we can (ab)use that:

ARG builder_image
FROM ${builder_image} as builder

FROM node:15.0.1-alpine3.12 as output
WORKDIR /app

COPY package.json yarn.lock /app
RUN yarn install --frozen-lockfile --production && yarn cache clean

COPY --from builder /app/dist /app

Now our build script can use the --build-arg parameter to force the right container:

docker build \
-  --target builder \
+  --file Dockerfile.builder \
  -t builder:$GIT_COMMIT \
  .

docker run --rm \
  -v "$PWD/artefacts/tests:/artefacts/tests" \
  builder:$GIT_COMMIT \
  yarn ci:test

docker run --rm \
  -v "$PWD/artefacts/lint:/artefacts/lint" \
  builder:$GIT_COMMIT \
  yarn ci:lint

docker build \
-  --cache-from builder:$GIT_COMMIT \
+  --build-arg "builder_image=builder:$GIT_COMMIT" \
+  --file Dockerfile.output \
  -t app:$GIT_COMMIT \
  .

We can now safely modfiy the working directory to our heart’s content without worring about invalidating the layer caches.

Better BASHing Through Technology

28 Aug 2020

I write a lot of bash scripts for both my day job and my personal projects, and while they are functional, bash scripts always seem to lack that structure that I want, especially when compared to writing something in Go or C#. The main problem I have with bash scripts is that when I use functions, I lose the ability to log things.

For example the get_config_path function will print the path to the configuration file, which will get consumed by the do_work function:

get_config_path() {
  echo "Finding Machine Configurtaion"

  if [ -n "$CONFIG_PATH" ] && [ -e "$CONFIG_PATH" ]; then
    # echo "Using Configuration from environment"
    echo "$CONFIG_PATH"
    return
  fi

  if [ -e "~/.config/demo/config.toml" ]; then
    # echo "Using Configuration directory file"
    echo "~/.config/demo/config.toml"
    return
  fi

  # echo "Unable to find configuration"
  exit 1
}

do_work() {
  local -r config=$(get_config_path)

  # actual work...
}

The problem is, if I include the echo statements which are log lines, the config variable in the do_work function will contain them too.

In the past, this has caused me to use functions in bash very sparingly; usually with things that are short that I don’t need to add much logging to. However, I was recently building some AMIs, and happened to be browsing the Consul and Vault terraform module source, which uses several bash scripts which are written in a different style, which makes them vastly more maintainable.

So let’s have a look at the new structure I use, which is heavily based off these scripts.

Better Bash Scripts

Before we get to the implementation, make sure you are using ShellCheck it provides static analysis of your scripts, pointing out many things like missing quotes, or incorrectly handling arrays. It has plugins for all your favourite editors too.

0. General Styles and Constructs

All variables should be declared readonly, and local if possible, to help prevent surprise values from being present if other functions forget to do the same.

#top level
readonly SCRIPT_NAME="$(basename "$0")"

some_method() {
  # within a method
  local -r config_path="$1"
}

Functions should assign their passed arguments to named variables as the first thing they do, preferably matching the variable name they are passed, which helps later when you are searching through a script for usages of “config_file” and not having to find other names/aliases for the same value.

read_config() {
  local -r config_file="$1"
  local -r skip_validation="$2"

  # ...
}

invoke() {
  # ...

  read_config "$config_file" "$skip_validation"
}

1. Error Handling

It should go without saying, but you really need to start your scripts with the following:

#!/bin/bash

set -euo pipefail;

There are many better articles on what these specifically do, but suffice to say:

  • e causes the script to stop on errors
  • u causes it to error on undefined variables being used
  • o pipefail causes a non-zero exit code from any command in a pipeline to fail the script too (rather than just the last command.)

2. Logging

The real advantage of this structure is we get to have log statements! This is achieved by doing all logging to stderr instead of stdout. We use a standardised log function across all the scripts, which also includes the script’s name so when calling other scripts you can see which one wrote the log line:

readonly SCRIPT_NAME="$(basename "$0")"

log() {
  local -r level="$1"
  local -r message="$2"
  local -r timestamp=$(date +"%Y-%m-%d %H:%M:%S")

  >&2 echo -e "${timestamp} [${level}] [$SCRIPT_NAME] ${message}"
}

Invoking the function is log "INFO" "Some status" or log "WARN" "Something concerning" etc.

3. Error Checking

We have some standard assertion functions which are used by the script when starting up to validate arguments:

assert_not_empty() {
  local -r arg_name="$1"
  local -r arg_value="$2"

  if [[ -z "$arg_value" ]]; then
    log "ERROR" "The value for '$arg_name' cannot be empty"
    exit 1
  fi
}

assert_is_installed() {
  local -r name="$1"

  if [[ ! $(command -v "$name") ]]; then
    log "ERROR" "The binary '$name' is required by this script but is not installed or in the system's PATH."
    exit 1
  fi
}

4. Argument parsing

When scripts need to take parameters in, I prefer to use long-flag style, as they are little more readable for people checking invocations again in the future. This function is usually always called run, and is the last function defined, and is invoked immediately after definition, passing in all script arguments (run "[email protected]"):

run() {
  local namespace=""
  local suffix=""
  local dry_run="false"

  while [[ $# -gt 0 ]]; do
    local key="$1"

    case "$key" in
      --namespace)
        namespace="$2"
        shift
        ;;
      --suffix)
        assert_not_empty "$key" "$2"
        suffix="$2"
        shift
        ;;
      --dry-run)
        dry_run="true"
        ;;
      --help)
        print_usage
        exit
        ;;
      *)
        log "ERROR" "Unrecognized argument: $key"
        print_usage
        exit 1
        ;;
    esac

    shift
  done

  # mandatory flag validation
  assert_not_empty "--namespace" "$namespace"

  # make sure tools are installed
  assert_is_installed "vault"
  assert_is_installed "openssl"
  assert_is_installed "jq"

  # do the work!
  local -r cert=$(generate_cert "$suffix")

  store_cert "$namespace" "$cert" "$dry_run"

}

run "[email protected]"

The validation uses the assert_not_empty function defined above, which is used in two ways: after the while loop to check mandatory values have been filled in, and within the case statement for optional flags values.

We also use assert_is_installed to validate that utilities we need are installed, such as vault, openssl and jq

The print_usage function is just a set of echo statements giving all the flags, and an example of invokation:

print_usage() {
  echo
  echo "Usage: $SCRIPT_NAME [OPTIONS]"
  echo
  echo "This script creates a new certiticate, and it installs it into the right namespace"
  echo
  echo "Options:"
  echo
  echo -e "  --namespace\tThe namespace to install the certificate in"
  echo -e "  --suffix\tAn optional suffix for the hostname"
  echo -e "  --dry-run\tDon't install the certificate"
  echo
  echo "Example:"
  echo
  echo "  $SCRIPT_NAME --namespace test --dry-run"
}

Usage

I keep a single template file which has all of this written into it, and new scripts start off with a copy-paste of the template. Could it be DRYer? Sure, but then I have to deal with dependency management, and it’s just not worth the hassle and overhead.

Sharing Docker Layers Between Build Agents

14 May 2020

Recently, I noticed that when we pull a new version of our application’s docker container, it fetches all layers, not just the ones that change.

The problem is that we use ephemeral build agents, which means that each version of the application is built using a different agent, so Docker doesn’t know how to share the layers used. While we can pull the published container before we run the build, this only helps with the final stage of the build. We want to cache the other stages of the build too, as the earlier layers don’t change often, and can be quite slow to build.

We can achieve this by tweaking how we build our stages, which will also allow some other interesting optimisations.

The Dockerfile

An example dockerfile is below. There are two stages, builder and prod. In the case we are looking at, both the OS packages and application dependencies rarely change, but can take quite a while to install.

FROM node:14.2.0-alpine3.11 AS builder
WORKDIR /app

RUN apk add --no-cache make gcc g++ python

COPY package.json yarn.lock ./
RUN yarn install --no-progress --frozen-lockfile && \
    yarn cache clean

COPY ./src ./src
RUN yarn build


FROM node:14.2.0-alpine3.11 AS prod
WORKDIR /app

COPY package.json yarn.lock ./

RUN yarn install --production --no-progress --frozen-lockfile && \
    yarn cache clean

COPY --from=builder /app/dist ./dist

CMD ["yarn", "start"]

The first step is to try and pull both :builder and :latest images. We append || true as the images might not exist yet, and we want the build to pass if they don’t!

docker pull app:builder || true
docker pull app:latest || true

Now that we have the application images locally, we can proceed to building the :builder stage. We tag it twice: once with just app:builder and once with the short-commit that built it.

docker build \
    --cache-from=app:builder \
    --target builder \
    -t app:builder-$COMMIT_SHORT \
    -t app:builder \
    .

Now that we have built our builder stage, we can use this to do lots of other things which require both dependencies and devDependencies, such as running tests and linters, and we could even distribute these tasks to multiple other machines if we wanted extra parallelism:

docker run --rm -it app:builder-$COMMIT_SHORT yarn test
docker run --rm -it app:builder-$COMMIT_SHORT yarn test:integration
docker run --rm -it app:builder-$COMMIT_SHORT yarn lint

Once we are happy with our tests, we can now build the production container, which we do by using the --cache-from directive twice; once with the builder image we just created, and once with the latest version of our application. Note the order of the --cache-from parameters matters; this won’t work if you specify the app:latest before app:builder!

docker build \
    --cache-from=app:builder \
    --cache-from=app:latest \
    -t app:$COMMIT_SHORT \
    -t app:latest \
    .

Now we can publish everything. We always publish the commit tagged version so that separate branch builds can be fetched and tested, and if the branch is master, we publish both the :builder and :latest tags:

docker push app:$COMMIT_SHORT

if [ "$BRANCH" == "master" ]; then
    docker push app:builder
    docker push app:latest
fi

The full build script looks like this:

docker pull app:builder || true
docker pull app:latest || true

docker build \
    --cache-from=app:builder \
    --target builder \
    -t app:builder-$COMMIT_SHORT \
    -t app:builder \
    .

# run these in parallel
docker run --rm -it app:builder-$COMMIT_SHORT yarn test
docker run --rm -it app:builder-$COMMIT_SHORT yarn test:integration
docker run --rm -it app:builder-$COMMIT_SHORT yarn lint

docker build \
    --cache-from=app:builder \
    --cache-from=app:latest \
    -t app:$COMMIT_SHORT \
    -t app:latest \
    .

docker push app:$COMMIT_SHORT

if [ "$BRANCH" == "master" ]; then
    docker push app:builder
    docker push app:latest
fi

Effects

By publishing both our :builder and :latest tags, we can effectively share the layer caches for all build stages across all build agents. As the layers are shared, pulling the images at the beginning of the builds is pretty fast, and the publishes at the end of the build are very, very fast.

The real benefit comes with building our monolith, which now only needs a small layer to be pulled on deployment, rather than all of the layers, which speeds up our deployments by minutes per host.