Alarm Fatigue

30 Oct 2017

I’ve been on-call for work over the last week for the first time, and while it wasn’t as alarming (heh) as I thought it might be, I have had a few thoughts on it.

Non-action Alarms

We have an alarm periodically about an MVC View not getting passed the right kind of model. The resolution is to mark the bug as completed/ignored in YouTrack. Reading the stack trace, I can see that the page is expecting a particular model, but is being given a HandleErrorInfo model, which is an in built type. After some investigation and a quick pull-request, we no longer get that error message. Turns out the controller was missing an attribute which would allow custom error handling.

Un-aggregated Alarms

If there is one time out in a system…I don’t care that much. If there are multiple in a short space of time then I want an alarm. Otherwise, it should be logged and reviewed when I am next at the office, so I can look for a pattern, such as it happening hourly on the hour, or every 28 hours.

Bad Error Messages

No or more than one usable entry found - this exception makes sense, if you have the context it is thrown from. However, when reading the stacktrace in YouTrack, it’s not idea. Some investigation shows that all of the data required to write some good exception messages is available.

This gets harder when the exception is thrown from a library, especially when most of the code required to generate the exception is marked as internal, and that the library only throws one kind of error, differing only by message. They way I will solve this one is to catch the error, and if the message is the one I care about, throw a new version with a better message. Unfortunately to build that message, I will have to regex the first exception’s message. Sad times.

Run Books

We have some! Not as many as I would like, but hopefully I can expand on them as I learn things. One thing I noticed about them is they are named after the business purpose, which is great from a business perspective…but is not so great when I am looking to see if there is a run book for an exception in YouTrack. A few things can be done to fix this.

First off, we can rename the run books to resemble the exception messages to make them easier to find. The business description can be included as a sub-title, or in the body of the run book.

Next is to update the exceptions thrown to have a link to the relevant run book in them, so that when someone opens a ticket in YouTrack, they can see the link to the how to solve it.

Third, and my favourite, is to get rid of the run book entirely, by automating it. If the run book contains a fix like “run x query, and if it has rows, change this column to mark the job as processed”, then we can just make the code itself handle this case, or modify the code to prevent this case from happening at all.

Overall

Overall, I enjoyed being on call this week. Nothing caught fire too much, and I have learnt quite a bit about various parts of our system which I don’t often interact with. But on-call (and thus on-boarding) can be improved - so that’s what I am going to do. Hopefully when a new person enters the on-call schedule, they will have an even easier time getting up to speed.

support, alarms, oncall

---

Vagrant in the world of Docker

22 Oct 2017

I gave a little talk at work recently on my use of Vagrant, what it is, and why it is still useful in a world full of Docker containers.

So, What is Vagrant?

Vagrant is a product by Hashicorp, and is for scripting the creation of (temporary) virtual machines. It’s pretty fast to create a virtual machine with too, as it creates them from a base image (known as a “box”.)

It also supports multiple virtualisation tools, such as VirtualBox and HyperV. If you are already using Packer to create AMIs for your Amazon infrastructure, you can modify your packerfile to also output a Vagrant box.

As an example, this is a really basic VagrantFile for creating a basic Ubuntu box:

Vagrant.configure("2") do |config|
  config.vm.box = "hashicorp/precise64"

  config.vm.provider "hyperv" do |h|
    h.vmname = "UbuntuPrecise"
    h.cpus = 2
    h.memory = 2048
  end
end

To create the vm, on your command line just run

vagrant up # creates the virtual machine
vagrant ssh # ssh into the virtual machine
vagrant destroy -f # destroy the virtual machine

What can I use it for?

Personally I have three main uses for a Vagrant boxes; Performance/Environment Testing, Cluster Testing, and Complete Environment Setup.

Performance and Environment Testing

When I am developing a service which will be deployed to AWS, we tend to know rougly what kind of instance it will be deployed to, for example T2.Small. The code we develop local performs well…but that is on a development machine with anywhere from 4 to 16 CPU cores, and 8 to 32 GB of memory, and SSD storage. How do you know what performance will be like when running on a 2 core, 2048 MB machine in AWS?

While you can’t emulate AWS exactly, it has certainly helped us tune applications - for example modifying how many parallel messages to handle when receiving from RabbitMQ (you can see about how to configure this in my previous post Concurrency in RabbitMQ.)

Cluster Testing

When you want to test a service which will operate in a cluster, Vagrant comes to the rescue again - you can use the define block to setup multiple copies of the machine, and provide common provisioning:

Vagrant.configure("2") do |config|
  config.vm.box = "hashicorp/precise64"
  config.vm.provision "shell", inline: <<SCRIPT
  # a bash script to setup your service might go here
SCRIPT

  config.vm.provider "hyperv" do |h|
    h.vmname = "UbuntuPrecise"
    h.cpus = 1
    h.memory = 1024
  end

  config.vm.define "first"
  config.vm.define "second"
  config.vm.define "third"
end

If you want to do more configuration of your separate instances, you can provider a block to do so:

  config.vm.define "third" do |third|
    third.vm.provision "shell", inline: "./vagrant/boot-cluster.sh"
  end

Complete Environment

If you’re developing a microservice in an environment with many other microservies which it needs to interact with, it can be a pain to setup all the hosts and supporting infrastructure.

Instead, we can create a single base box which contains all of the setup and services, then each microservice can have a VagrantFile which is based off the base box, but also: stops the service you are developing, and starts the version which is located in the /vagrant share instead:

Vagrant.configure("2") do |config|
  config.vm.box = "mycorp/complete-environment"

  config.vm.provider "hyperv" do |h|
    h.vmname = "WhateverServiceEnvironment"
    h.cpus = 4
    h.memory = 4096
  end

  # replace-service is a script which stops/removes an existing service,
  # and installs/starts a replacement. it uses a convention which expects
  # a service to have a script at `/vagrant/<name>/bin/<name>.sh`
  config.vm.provision "shell", inline: "./replace-service.sh WhateverService"

end

In this case, the mycorp/complete-environment box would have all the services installed and started, and also a script in the machine root which does all the work to replace a service with the one under development.

This base box could also be used to provide a complete testing environment too - just create a Vagrant file with no additional provisioning, and call vagrant up.

Couldn’t you use Docker for this instead?

Well yes, you can user Docker…but for some tasks, this is just easier. We can also utilise Docker as both input and output for this; the base image could run docker internally to run all the services, or we could use a Packer script which would generate a Docker container of this setup and a vagrant box.

Just because Docker is the cool thing to be using these days, doesn’t mean Vagrant doesn’t have any uses any more. Far from it!

code, docker, vagrant, testing

---

Testing RabbitMQ Concurrency in MassTransit

11 Oct 2017

We have a service which consumes messages from a RabbitMQ queue - for each message, it makes a few http calls, collates the results, does a little processing, and then pushes the results to a 3rd party api. One of the main benefits to having this behind a queue is our usage pattern - the queue usually only has a few messages in it per second, but periodically it will get a million or so messages within 30 minutes (so from ~5 messages/second to ~560 messages/second.)

Processing this spike of messages takes ages, and while this service is only on a T2.Medium machine (2 CPUs, 4GB Memory), it only uses 5-10% CPU while processing the messages, which is clearly pretty inefficient.

We use MassTransit when interacting with RabbitMQ as it provides us with a lot of useful features, but by default sets the amount of messages to be processed in parallel to Environment.ProcessorCount * 2. For this project that means 4 messages, and as the process is IO bound, it stands to reason that we could increase that concurrency a bit. Or a lot.

The existing MassTransit setup looks pretty similar to this:

_bus = Bus.Factory.CreateUsingRabbitMq(rabbit =>
{
    var host = rabbit.Host(new Uri("rabbitmq://localhost"), h =>
    {
        h.Username("guest");
        h.Password("guest");
    });

    rabbit.ReceiveEndpoint(host, "SpikyQueue", endpoint =>
    {
        endpoint.Consumer(() => new TestConsumer());
    });
});

The Test (Driven Development)

As we like testing things, I wrote a test to validate the degree of concurrency we have. We use a real instance of RabbitMQ (Started with Docker, as part of the build), and have a test message and consumer. Due to the speed of RabbitMQ delivery, we make the consumer just take a little bit of time before returning:

class TestMessage
{
    public int Value { get; set; }
}

class TestConsumer : IConsumer<TestMessage>
{
    public async Task Consume(ConsumeContext<TestMessage> context)
    {
        await Task.Delay(600);
    }
}

The final piece of our puzzle is an IConsumeObserver, which will count the number of messages processed in parallel, as well as the total number of messages processed. We will use the total number of messages to know when our test can stop running, and the parallel number to prove if our concurrency changes worked.

What this observer is doing is the following, but as we are in a multithreaded environment, we need to use the Interlocked class, and do a bit more work to make sure we don’t lose values:

PreConsume:
    currentPendingDeliveryCount++
    maxPendingDeliveryCount = Math.Max(maxPendingDeliveryCount, currentPendingDeliveryCount)
PostConsume:
    currentPendingDeliveryCount--

The actual ConsumeCountObserver code is as follows:

class ConsumeCountObserver : IConsumeObserver
{
    int _deliveryCount;
    int _currentPendingDeliveryCount;
    int _maxPendingDeliveryCount;

    readonly int _messageCount;
    readonly TaskCompletionSource<bool> _complete;

    public ConsumeCountObserver(int messageCount)
    {
        _messageCount = messageCount;
        _complete = new TaskCompletionSource<bool>();
    }

    public int MaxDeliveryCount => _maxPendingDeliveryCount;
    public async Task Wait() => await _complete.Task;

    Task IConsumeObserver.ConsumeFault<T>(ConsumeContext<T> context, Exception exception) => Task.CompletedTask;

    Task IConsumeObserver.PreConsume<T>(ConsumeContext<T> context)
    {
        Interlocked.Increment(ref _deliveryCount);

        var current = Interlocked.Increment(ref _currentPendingDeliveryCount);
        while (current > _maxPendingDeliveryCount)
            Interlocked.CompareExchange(ref _maxPendingDeliveryCount, current, _maxPendingDeliveryCount);

        return Task.CompletedTask;
    }

    Task IConsumeObserver.PostConsume<T>(ConsumeContext<T> context)
    {
        Interlocked.Decrement(ref _currentPendingDeliveryCount);

        if (_deliveryCount == _messageCount)
            _complete.TrySetResult(true);

        return Task.CompletedTask;
    }
}

Finally, we can put the actual test together: We publish some messages, connect the observer, and start processing. Finally, when the observer indicates we have finished, we assert that the MaxDeliveryCount was the same as the ConcurrencyLimit:

[Test]
public async Task WhenTestingSomething()
{
    for (var i = 0; i < MessageCount; i++)
        await _bus.Publish(new TestMessage { Value = i });

    var observer = new ConsumeCountObserver(MessageCount);
    _bus.ConnectConsumeObserver(observer);

    await _bus.StartAsync();
    await observer.Wait();
    await _bus.StopAsync();

    observer.MaxDeliveryCount.ShouldBe(ConcurrencyLimit);
}

The Problem

The problem we had was actually increasing the concurrency: There are two things you can change, .UseConcurrencyLimit(32) and .PrefetchCount = 32, but doing this doesn’t work:

_bus = Bus.Factory.CreateUsingRabbitMq(rabbit =>
{
    var host = rabbit.Host(new Uri("rabbitmq://localhost"), h =>
    {
        h.Username("guest");
        h.Password("guest");
    });

    rabbit.ReceiveEndpoint(host, "SpikeyQueue", endpoint =>
    {
        endpoint.UseConcurrencyLimit(ConcurrencyLimit);
        endpoint.PrefetchCount = (ushort) ConcurrencyLimit;

        endpoint.Consumer(() => new TestConsumer());
    });
});

Or well…it does work, if the ConcurrencyLimit is less than the default. After a lot of trial and error, it turns out there are not two things you can change, but four:

  • rabbit.UseConcurrencyLimit(val)
  • rabbit.PrefetchCount = val
  • endpoint.UseConcurrencyLimit(val)
  • endpoint.PrefetchCount = val

This makes sense (kind of): You can set limits on the factory, and then the endpoints can be any value less than or equal to the factory limits. My process of trial and error to work out which needed to be set:

  1. Set them all to 32
  2. Run test
    • if it passes, remove one setting, go to 2.
    • if it fails, add last setting back, remove a different setting, go to 2.

After iterating this set of steps for a while, it turns out for my use case that I need to set rabbit.UseConcurrencyLimit(val) and endpoint.PrefetchCount = val:

_bus = Bus.Factory.CreateUsingRabbitMq(rabbit =>
{
    var host = rabbit.Host(new Uri("rabbitmq://localhost"), h =>
    {
        h.Username("guest");
        h.Password("guest");
    });

    rabbit.UseConcurrencyLimit(ConcurrencyLimit);
    rabbit.ReceiveEndpoint(host, "SpikeyQueue", endpoint =>
    {
        endpoint.PrefetchCount = (ushort) ConcurrencyLimit;
        endpoint.Consumer(() => new TestConsumer());
    });
});

Interestingly, no matter which place you set the PrefetchCount value, it doesn’t show up in the RabbitMQ web dashboard.

Hope this might help someone else struggling with getting higher concurrency with MassTransit.

code, masstransit, rabbitmq, testing

---

Composite Decorators with StructureMap

04 Oct 2017

While I was developing my Crispin project, I ended up needing to create a bunch of implementations of a single interface, and then use all those implementations at once (for metrics logging).

The interface looks like so:

public interface IStatisticsWriter
{
    Task WriteCount(string format, params object[] parameters);
}

And we have a few implementations already:

  • LoggingStatisticsWriter - writes to an ILogger instance
  • StatsdStatisticsWriter - pushes metrics to StatsD
  • InternalStatisticsWriter - aggregates metrics for exposing via Crispin’s api

To make all of these be used together, I created a fourth implementation, called CompositeStatisticsWriter (a name I made up, but apparently matches the Gang of Four definition of a composite!)

public class CompositeStatisticsWriter : IStatisticsWriter
{
    private readonly IStatisticsWriter[] _writers;

    public CompositeStatisticsWriter(IEnumerable<IStatisticsWriter> writers)
    {
        _writers = writers.ToArray();
    }

    public async Task WriteCount(string format, params object[] parameters)
    {
        await Task.WhenAll(_writers
            .Select(writer => writer.WriteCount(format, parameters))
            .ToArray());
    }
}

The problem with doing this is that StructureMap throws an error about a bi-directional dependency:

StructureMap.Building.StructureMapBuildException : Bi-directional dependency relationship detected!
Check the StructureMap stacktrace below:
1.) Instance of Crispin.Infrastructure.Statistics.IStatisticsWriter (Crispin.Infrastructure.Statistics.CompositeStatisticsWriter)
2.) All registered children for IEnumerable<IStatisticsWriter>
3.) Instance of IEnumerable<IStatisticsWriter>
4.) new CompositeStatisticsWriter(*Default of IEnumerable<IStatisticsWriter>*)
5.) Crispin.Infrastructure.Statistics.CompositeStatisticsWriter
6.) Instance of Crispin.Infrastructure.Statistics.IStatisticsWriter (Crispin.Infrastructure.Statistics.CompositeStatisticsWriter)
7.) Container.GetInstance<Crispin.Infrastructure.Statistics.IStatisticsWriter>()

After attempting to solve this myself in a few different ways (you can even watch the stream of my attempts), I asked in the StructreMap gitter chat room, and received this answer:

This has come up a couple times, and yeah, you’ll either need a custom convention or a policy that adds the other ITest’s to the instance for CompositeTest as inline dependencies so it doesn’t try to make Composite a dependency of itself – Jeremy D. Miller

Finally, Babu Annamalai provided a simple implementation when I got stuck (again).

The result is the creation of a custom convention for registering the composite, which provides all the implementations I want it to wrap:

public class CompositeDecorator<TComposite, TDependents> : IRegistrationConvention
    where TComposite : TDependents
{
    public void ScanTypes(TypeSet types, Registry registry)
    {
        var dependents = types
            .FindTypes(TypeClassification.Concretes)
            .Where(t => t.CanBeCastTo<TDependents>() && t.HasConstructors())
            .Where(t => t != typeof(TComposite))
            .ToList();

        registry
            .For<TDependents>()
            .Use<TComposite>()
            .EnumerableOf<TDependents>()
            .Contains(x => dependents.ForEach(t => x.Type(t)));
    }
}

To use this the StructureMap configuration changes from this:

public CrispinRestRegistry()
{
    Scan(a =>
    {
        a.AssemblyContainingType<Toggle>();
        a.WithDefaultConventions();
        a.AddAllTypesOf<IStatisticsWriter>();
    });

    var store = BuildStorage();

    For<IStorage>().Use(store);
    For<IStatisticsWriter>().Use<CompositeStatisticsWriter>();
}

To this version:

public CrispinRestRegistry()
{
    Scan(a =>
    {
        a.AssemblyContainingType<Toggle>();
        a.WithDefaultConventions();
        a.Convention<CompositeDecorator<CompositeStatisticsWriter, IStatisticsWriter>>();
    });

    var store = BuildStorage();
    For<IStorage>().Use(store);
}

And now everything works successfully, and I have Pull Request open on StructureMap’s repo with an update to the documentation about this.

Hopefully this helps someone else too!

code, structuremap, di, ioc

---

Integration Testing with Dotnet Core, Docker and RabbitMQ

02 Oct 2017

When building libraries, not only is it a good idea to have a large suite of Unit Tests, but also a suite of Integration Tests.

For one of my libraries (RabbitHarness) I have a set of tests which check it behaves as expected against a real instance of RabbitMQ. Ideally these tests will always be run, but sometimes RabbitMQ just isn’t available such as when running on AppVeyor builds, or if I haven’t started my local RabbitMQ Docker container.

Skipping tests if RabbitMQ is not available

First off, I prevent the tests from running if RabbitMQ is not available by using a custom XUnit FactAttribute:

public class RequiresRabbitFactAttribute : FactAttribute
{
	private static readonly Lazy<bool> IsAvailable = new Lazy<bool>(() =>
	{
		var factory = new ConnectionFactory { HostName = "localhost", RequestedConnectionTimeout = 1000 };

		try
		{
			using (var connection = factory.CreateConnection())
				return connection.IsOpen;
		}
		catch (Exception)
		{
			return false;
		}
	});

	public override string Skip
	{
		get { return IsAvailable.Value ? "" : "RabbitMQ is not available";  }
		set { /* nothing */ }
	}
}

This attribute will try connecting to a RabbitMQ instance on localhost once for all tests per run, and cause any test with this attribute to be skipped if RabbitMQ is not available.

Build Script & Docker

I decided the build script should start a RabbitMQ container, and use that for the tests, but I didn’t want to re-use my standard RabbitMQ instance which I use for all kinds of things, and may well be broken at any given time.

As my build script is just a bash script, I can check if the docker command is available, and then start a container if it is (relying on the assumption that if docker is available, I can start a container).

if [ -x "$(command -v docker)" ]; then
  CONTAINER=$(docker run -d --rm -p 5672:5672 rabbitmq:3.6.11-alpine)
  echo "Started RabbitMQ container: $CONTAINER"
fi

If docker is available, we start a new container. I use rabbitmq:3.6.11-alpine as it is a tiny image, with no frills, and also start it with the -d and --rm flags, which starts the container in a disconnected mode (e.g. the docker run command returns instantly), and will delete the container when it is stopped, taking care of clean up for us! I only bother binding the main data connection port (5672), as that is all we are going to be using. Finally, the container’s ID, which is returned by the docker run command, is stored in the CONTAINER variable.

I recommend putting this step as the very first part of your build script, as it gives the container time to start up RabbitMQ and be ready for connections while your build is running. Otherwise I found I was needing to put a sleep 5 command in afterwards to pause the script for a short time.

The script then continues on with the normal build process:

dotnet restore "$NAME.sln"
dotnet build "$NAME.sln" --configuration $MODE

find . -iname "*.Tests.csproj" -type f -exec dotnet test "{}" --configuration $MODE \;
dotnet pack ./src/$NAME --configuration $MODE --output ../../.build

Once this is all done, I have another check that docker exists, and stop the container we started earlier, by using the container ID in CONTAINER:

if [ -x "$(command -v docker)" ]; then
  docker stop $CONTAINER
fi

And that’s it! You can see the full build script for RabbitHarness here.

The only problem with this script is if you try and start a RabbitMQ container while you already have one running, the command will fail, but the build should succeed anyway as the running instance of RabbitMQ will work for the tests, and the docker stop command will just output that it can’t find a container with a blank ID.

I think I will be using this technique more to help provide isolation for builds - I think that the Microsoft/mssql-server-linux containers might be very useful for some of our work codebases (which do work against the Linux instances of MSSQL, even if they weren’t designed to!)

code, dotc#core, rabbitmq, docker, testing

---