Tweaking Processes to Remove Errors

09 Dec 2017

When we are developing (internal) Nuget packages at work, the process used is the following:

  1. Get latest of master
  2. New branch feature-SomethingDescriptive
  3. Implement feature
  4. Push to GitHub
  5. TeamCity builds
  6. Publish package to the nuget feed
  7. Pull request
  8. Merge to master

Obviously 3 to 6 can repeat many times if something doesn’t work out quite right.

There are a number of problems with this process:

Pull-request after publishing

Pull requests are a great tool which we use extensively, but in this case, they are being done too late. By the time another developer has reviewed something, possibly requesting changes, the package is published.

Potentially broken packages published

As packages are test-consumed from the main package feed, there is the chance that someone else is working on another code base, and decides to update the nuget which you have just published. Now they are pulling in a potentially broken, or unreviewed package.

Published package is not nessacarily what is on master

Assuming the pull-request is approved with no changes, then the code is going to make it to master. However there is nothing to stop another developer’s changes getting to master first, and now you have a merge…and the published package doesn’t match what the source says it contains.

Feature/version conflicts with multiple developers

A few of our packages get updated fairly frequently, and there is a strong likelyhood that two developers are adding things to the same package. Both publish their package off their feature branch, and now someone’s changes have been “lost” as the latest package doesn’t have bother developer’s changes.

Soltuon: Continuous Delivery / Master Based Development

We can solve all of these issues by changing the process to be more “Trunk Based”:

  1. Get latest of master
  2. New branch feature-SomethingDescriptive
  3. Implement feature
  4. Push to GitHub
  5. Pull request
  6. TeamCity builds branch
  7. Merge to master
  8. TeamCity builds & publishes the package

All we have really changed here is to publish from master, rather than your feature branch. Now a pull-request has to happen (master branch is Protected in GitHub) before you can publish a package, meaning we have elimnated all of the issues with our previous process.

Except one, kind of.

How do developers test their new version of the package is correct from a different project? There are two solutions to this (and you could implement both):

  • Publish package to a local nuget feed
  • Publish packages from feature branches as -pre versions

The local nuget feed is super simple to implement: just use a directory e.g. I have /d/dev/local-packages/ defined in my machine’s nuget.config file. We use Gulp for our builds, so modifying our gulp publish task to publish locally when no arguments are specified would be trivial.

The publishing of Pre-release packages can also be implemented through our gulp scripts: we just need to adjust TeamCity to pass in the branch name to the gulp command (gulp ci --mode=Release --barnch "%vcsroot.branch%"), and we can modify the script to add the -pre flag to the version number if the branch parameter is not master.

Personally, I would use local publishing only, and implement the feature branch publishing if the package in question is consumed by multiple teams, and you would want an external team to be able to verify the changes made before a proper release.

Now our developers can still test their package works from a consuming application, and not clutter the nuget feed with potentially broken packages.

design, process

---

Evolutionary Development

17 Nov 2017

Having recently finished reading the Building Evolutionary Architectures: Support Constant Change book, I got to thinking about a system which was fairly representative of an architecture which was fine for it’s initial version, but it’s usage had outgrown the architecture.

Example System: Document Storage

The system in question was a file store for a multi user, internal, desktop based CRM system. The number of users was very small, and the first implementation was just a network file share. This was a fine solution to start with, but as the number of CRM users grew, cracks started to appear in the system.

A few examples of problems seen were:

  • Concurrent writes to the same files
  • Finding files for a specific record in the CRM
  • Response time
  • Files “going missing”
  • Storage size
  • Data retention rules

Most of this was caused by the number of file stored, which was well past the 5 million mark. For example, queries for “all files for x record” got slower and slower over time.

Samba shares can’t be listed in date-modified order (you actually get all the file names, then sorting is applied), which means you can’t auto delete old files, or auto index (e.g. export text to elasticsearch) updated files easily.

The key to dealing with this problem is to take small steps - if you have a large throughput to support, the last thing you want to do is break it for everyone at once, by doing a “big bang” release.

Not only can we take small steps in deploying our software, but we can also utilise Feature Toggles to make things safer. We can switch on a small part of the new system for a small percentage of users, and slowly ramp up usage while monitoring for errors.

Incremental Replacement

To replace this in an incremental manner, we are going to do the following 4 actions for every feature, until all features are done:

  1. Implement new feature in API and client
  2. Deploy client (toggle: off)
  3. Deploy API
  4. Start toggle roll out

Now that we know how each feature is going to be delivered, we can write out our list of features, in a rough implementation order:

  • Create API, build scripts, CI and deployment pipeline
  • Implement authentication on the API
  • Implement fetching a list of files for a record
  • Implement fetching a single file’s content for a record
  • Implement storing a single file for a record
  • Implement deletion of a single file for a record

The development and deployment of our features can be overlapped too: we can be deploying the next version of the client with the next feature off while we are still rolling out the previous feature(s). This all assumes that your features are nice and isolated however!

Once this list of features is done, and all the toggles are on, from the client perspective we are feature complete.

We are free to change how the backend of the API works. As long as we don’t change the API’s contract, the client doesn’t need any more changes.

Our next set of features could be:

  • Implement audit log of API actions
  • Publish store and delete events to a queue
  • Change our indexing process to consume the store and delete events
  • Make the samba hidden (except to the API)
  • Implement background delete of old documents
  • Move storage backend (to S3, for example)

This list of features doesn’t impact the front end (client) system, but the backend systems can now have a more efficient usage of the file store. As with the client and initial API development, we would do this with a quick, iterative process.

But we can’t do iterative because…

This is a common reaction when an iterative approach is suggested, and thankfully can be countered in a number of ways.

First off, if this is an absolute requirement, we can do our iterations an feature toggling rollouts to another environment, such a Pre-Production, or QA. While this reduces some of the benefits (we loose out on live data ramp up), it does at least keep small chunks of work.

Another work around is to use feature toggles anyway, but only have a couple of “trusted” users use the new functionality. Depending on what you are releasing, this could mean a couple of users you know, or giving a few users a non-visible change (i.e. they’re not aware they’ve been selected!) You could also use NDA (Non Disclosure Agreements) if you need to keep them quiet, although this is quite an extreme measure.

A final option is to use experiments, using an experimentation library (such as Github’s Scientist) which continues to use the existing features, but in parallel runs and records the results of the replacement feature. This obviously has to be done with care, as you don’t want to cause side effects.

How do you replace old software? Big bang, iterative, experimentation, or some other process?

design, architecture, process

---

Strong Configuration Composition

09 Nov 2017

It’s no secret I am a fan of strong typing - not only do I talk and blog about it a lot, but I also have a library called Stronk which provides strong typed configuration for non dotnet core projects.

The problem I come across often is large configurations. For example, given the following project structure (3 applications, all reference the Domain project):

DemoService
`-- src
    |-- Domain
    |   |-- Domain.csproj
    |   `-- IConfiguration.cs
    |-- QueueConsumer
    |   |-- app.config
    |   |-- QueueConsumerConfiguration.cs
    |   `-- QueueConsumer.csproj
    |-- RestApi
    |   |-- RestConfiguration.cs
    |   |-- RestApi.csproj
    |   `-- web.config
    `-- Worker
        |-- app.config
        |-- WorkerConfiguration.cs
        `-- Worker.csproj

The configuration defined in the domain will look something like this:

public interface IConfiguration
{
    string ApplicationName { get; }
    string LogPath { get; }
    Uri MetricsEndpoint { get; }

    Uri DocumentsEndpoint { get; }
    Uri ArchivalEndpoint { get; }

    string RabbitMqUsername { get; }
    string RabbitMqPassword { get; }
    string RabbitMqVHost { get; }

    string BulkQueue { get; }
    string DirectQueue { get; }
    string NotificationsQueue { get; }

    Uri RabbitMqConnection { get; }
    string DatabaseConnection { get; }
    string CacheConnection { get; }
}

There are a number of problems with this configuration:

First off, it lives in the Domain project, which kinda makes sense, as things in there need access to some of the properties - but none of them need to know the name of the Queue being listened to, or where the metrics are being written to.

Next, and also somewhat related to the first point, is that all the entry projects (RestApi, QueueConsumer and Worker) need to supply all the configuration values, and you can’t tell at a glance which projects actually need which values.

Finally, classes which use this configuration are less externally discoverable. For example, which properties does this need: new DocumentDeduplicator(new Configuration())? Probably the cache? Maybe the database? or possibly the DocumentsEndpoint? Who knows without opening the class.

The Solution

The key to solving this is the Interface Segregation Principal - the I in SOLID. First we need to split the interface into logical parts, which will allow our consuming classes to only take in the configuration they require, rather than the whole thing:

public interface IRabbitConfiguration
{
    Uri RabbitMqConnection { get; }

    string RabbitMqUsername { get; }
    string RabbitMqPassword { get; }
    string RabbitMqVHost { get; }

    string BulkQueue { get; }
    string DirectQueue { get; }
    string NotificationsQueue { get; }
}

public interface IDeduplicationConfiguration
{
    Uri DocumentsEndpoint { get; }
    string CacheConnection { get; }
}

public interface IStorageConfiguration
{
    Uri ArchivalEndpoint { get; }
    string DatabaseConnection { get; }
}

We can also move the IRabbitConfiguration and IDeduplicationConfiguration out of the domain project, and into the QueueConsumer and Worker projects respectively, as they are only used by types in these projects:

DemoService
`-- src
    |-- Domain
    |   |-- Domain.csproj
    |   `-- IStorageConfiguration.cs
    |-- QueueConsumer
    |   |-- app.config
    |   |-- IRabbitConfiguration.cs
    |   |-- QueueConsumerConfiguration.cs
    |   `-- QueueConsumer.csproj
    |-- RestApi
    |   |-- RestConfiguration.cs
    |   |-- RestApi.csproj
    |   `-- web.config
    `-- Worker
        |-- app.config
        |-- IDeduplicationConfiguration.cs
        |-- WorkerConfiguration.cs
        `-- Worker.csproj

Next we can create some top-level configuration interfaces, which compose the relevant configuration interfaces for a project (e.g. the RestApi doesn’t need IDeduplicationConfiguration or IRabbitConfiguration):

public interface IWorkerConfiguration : IStorageConfiguration, IDeduplicationConfiguration
{
    string ApplicationName { get; }
    string LogPath { get; }
    Uri MetricsEndpoint { get; }
}

public interface IRestConfiguration : IStorageConfiguration
{
    string ApplicationName { get; }
    string LogPath { get; }
    Uri MetricsEndpoint { get; }
}

public interface IQueueConsumerConfiguration : IStorageConfiguration, IRabbitConfiguration
{
    string ApplicationName { get; }
    string LogPath { get; }
    Uri MetricsEndpoint { get; }
}

Note how we have also not created a central interface for the application configuration - this is because the application configuration is specific to each entry project, and has no need to be passed on to the domain.

Finally, an actual configuration class can be implemented (in this case using Stronk, but if you are on dotnet core, the inbuilt configuration builder is fine):

public class QueueConsumerConfiguration : IQueueConsumerConfiguration
{
    string ApplicationName { get; private set; }
    string LogPath { get; private set; }
    Uri MetricsEndpoint { get; private set; }

    Uri ArchivalEndpoint { get; private set; }
    string DatabaseConnection { get; private set; }
    Uri RabbitMqConnection { get; private set; }

    string RabbitMqUsername { get; private set; }
    string RabbitMqPassword { get; private set; }
    string RabbitMqVHost { get; private set; }

    string BulkQueue { get; private set; }
    string DirectQueue { get; private set; }
    string NotificationsQueue { get; private set; }

    public QueueConsumerConfiguration()
    {
        this.FromAppConfig();
    }
}

And our startup class might look something like this (using StructureMap):

public class Startup : IDisposable
{
    private readonly Container _container;
    private readonly IConsumer _consumer;

    public Startup(IQueueConsumerConfiguration config)
    {
        ConfigureLogging(config);
        ConfigureMetrics(config);

        _container = new Container(_ =>
        {
            _.Scan(a => {
                a.TheCallingAssembly();
                a.LookForRegistries();
            })

            _.For<IQueueConsumerConfiguration>().Use(config);
            _.For<IStorageConfiguration>().Use(config);
            _.For<IRabbitConfiguration>().Use(config);
        });

        _consumer = _container.GetInstance<IConsumer>();
    }

    public async Task Start() => await _consumer.Start();
    public async Task Stop() => await _consumer.Stop();

    private void ConfigureLogging(IQueueConsumerConfiguration config) { /* ... */ }
    private void ConfigureMetrics(IQueueConsumerConfiguration config) { /* ... */ }

    public void Dispose() => _container.Dispose();
}

As our Startup takes in the top-level configuration interface, if we want to write a test which tests our entire system, it can be done with a single mocked configuration object:

[Fact]
public async Task When_the_entire_system_is_run()
{
    var config = Substitute.For<IQueueConsumerConfiguration>();
    config.RabbitMqConnection.Returns(new Uri("localhost:5672"));
    // etc.

    var startup = new Startup(config);
    await startup.Start();
    await startup.Stop();
}

One Final Thing

Even if you have a microservice type project with only the one csproj, I would still recommend splitting your configuration into small interfaces, just due to the discoverability it provides.

How do you do configuration?

code, configuration, design, architecture, net, strongtyping, stronk

---

Alarm Fatigue

30 Oct 2017

I’ve been on-call for work over the last week for the first time, and while it wasn’t as alarming (heh) as I thought it might be, I have had a few thoughts on it.

Non-action Alarms

We have an alarm periodically about an MVC View not getting passed the right kind of model. The resolution is to mark the bug as completed/ignored in YouTrack. Reading the stack trace, I can see that the page is expecting a particular model, but is being given a HandleErrorInfo model, which is an in built type. After some investigation and a quick pull-request, we no longer get that error message. Turns out the controller was missing an attribute which would allow custom error handling.

Un-aggregated Alarms

If there is one time out in a system…I don’t care that much. If there are multiple in a short space of time then I want an alarm. Otherwise, it should be logged and reviewed when I am next at the office, so I can look for a pattern, such as it happening hourly on the hour, or every 28 hours.

Bad Error Messages

No or more than one usable entry found - this exception makes sense, if you have the context it is thrown from. However, when reading the stacktrace in YouTrack, it’s not idea. Some investigation shows that all of the data required to write some good exception messages is available.

This gets harder when the exception is thrown from a library, especially when most of the code required to generate the exception is marked as internal, and that the library only throws one kind of error, differing only by message. They way I will solve this one is to catch the error, and if the message is the one I care about, throw a new version with a better message. Unfortunately to build that message, I will have to regex the first exception’s message. Sad times.

Run Books

We have some! Not as many as I would like, but hopefully I can expand on them as I learn things. One thing I noticed about them is they are named after the business purpose, which is great from a business perspective…but is not so great when I am looking to see if there is a run book for an exception in YouTrack. A few things can be done to fix this.

First off, we can rename the run books to resemble the exception messages to make them easier to find. The business description can be included as a sub-title, or in the body of the run book.

Next is to update the exceptions thrown to have a link to the relevant run book in them, so that when someone opens a ticket in YouTrack, they can see the link to the how to solve it.

Third, and my favourite, is to get rid of the run book entirely, by automating it. If the run book contains a fix like “run x query, and if it has rows, change this column to mark the job as processed”, then we can just make the code itself handle this case, or modify the code to prevent this case from happening at all.

Overall

Overall, I enjoyed being on call this week. Nothing caught fire too much, and I have learnt quite a bit about various parts of our system which I don’t often interact with. But on-call (and thus on-boarding) can be improved - so that’s what I am going to do. Hopefully when a new person enters the on-call schedule, they will have an even easier time getting up to speed.

support, alarms, oncall

---

Vagrant in the world of Docker

22 Oct 2017

I gave a little talk at work recently on my use of Vagrant, what it is, and why it is still useful in a world full of Docker containers.

So, What is Vagrant?

Vagrant is a product by Hashicorp, and is for scripting the creation of (temporary) virtual machines. It’s pretty fast to create a virtual machine with too, as it creates them from a base image (known as a “box”.)

It also supports multiple virtualisation tools, such as VirtualBox and HyperV. If you are already using Packer to create AMIs for your Amazon infrastructure, you can modify your packerfile to also output a Vagrant box.

As an example, this is a really basic VagrantFile for creating a basic Ubuntu box:

Vagrant.configure("2") do |config|
  config.vm.box = "hashicorp/precise64"

  config.vm.provider "hyperv" do |h|
    h.vmname = "UbuntuPrecise"
    h.cpus = 2
    h.memory = 2048
  end
end

To create the vm, on your command line just run

vagrant up # creates the virtual machine
vagrant ssh # ssh into the virtual machine
vagrant destroy -f # destroy the virtual machine

What can I use it for?

Personally I have three main uses for a Vagrant boxes; Performance/Environment Testing, Cluster Testing, and Complete Environment Setup.

Performance and Environment Testing

When I am developing a service which will be deployed to AWS, we tend to know rougly what kind of instance it will be deployed to, for example T2.Small. The code we develop local performs well…but that is on a development machine with anywhere from 4 to 16 CPU cores, and 8 to 32 GB of memory, and SSD storage. How do you know what performance will be like when running on a 2 core, 2048 MB machine in AWS?

While you can’t emulate AWS exactly, it has certainly helped us tune applications - for example modifying how many parallel messages to handle when receiving from RabbitMQ (you can see about how to configure this in my previous post Concurrency in RabbitMQ.)

Cluster Testing

When you want to test a service which will operate in a cluster, Vagrant comes to the rescue again - you can use the define block to setup multiple copies of the machine, and provide common provisioning:

Vagrant.configure("2") do |config|
  config.vm.box = "hashicorp/precise64"
  config.vm.provision "shell", inline: <<SCRIPT
  # a bash script to setup your service might go here
SCRIPT

  config.vm.provider "hyperv" do |h|
    h.vmname = "UbuntuPrecise"
    h.cpus = 1
    h.memory = 1024
  end

  config.vm.define "first"
  config.vm.define "second"
  config.vm.define "third"
end

If you want to do more configuration of your separate instances, you can provider a block to do so:

  config.vm.define "third" do |third|
    third.vm.provision "shell", inline: "./vagrant/boot-cluster.sh"
  end

Complete Environment

If you’re developing a microservice in an environment with many other microservies which it needs to interact with, it can be a pain to setup all the hosts and supporting infrastructure.

Instead, we can create a single base box which contains all of the setup and services, then each microservice can have a VagrantFile which is based off the base box, but also: stops the service you are developing, and starts the version which is located in the /vagrant share instead:

Vagrant.configure("2") do |config|
  config.vm.box = "mycorp/complete-environment"

  config.vm.provider "hyperv" do |h|
    h.vmname = "WhateverServiceEnvironment"
    h.cpus = 4
    h.memory = 4096
  end

  # replace-service is a script which stops/removes an existing service,
  # and installs/starts a replacement. it uses a convention which expects
  # a service to have a script at `/vagrant/<name>/bin/<name>.sh`
  config.vm.provision "shell", inline: "./replace-service.sh WhateverService"

end

In this case, the mycorp/complete-environment box would have all the services installed and started, and also a script in the machine root which does all the work to replace a service with the one under development.

This base box could also be used to provide a complete testing environment too - just create a Vagrant file with no additional provisioning, and call vagrant up.

Couldn’t you use Docker for this instead?

Well yes, you can user Docker…but for some tasks, this is just easier. We can also utilise Docker as both input and output for this; the base image could run docker internally to run all the services, or we could use a Packer script which would generate a Docker container of this setup and a vagrant box.

Just because Docker is the cool thing to be using these days, doesn’t mean Vagrant doesn’t have any uses any more. Far from it!

code, docker, vagrant, testing

---