SketchNotes: Finding Your Service Boundaries

10 Sep 2018

At NDC Oslo this year, I attended Adam Ralph’s talk on Finding Your Service Boundaries. I enjoyed it a lot, and once the video came out, I rewatched it, and decided to have a go at doing a “sketchnotes”, which I shared on Twitter, which people liked!

I’ve never done one before, but it was pretty fun. I made it in OneNote, zoomed out a lot, and took a screenshot.

Click for huuuuge!

sketchnotes - finding your service boundaries

sketchnotes, conference

---

Semantic Configuration Validation: Earlier

08 Sep 2018

After my previous post on Validating Your Configuration, one of my colleagues made an interesting point, paraphrasing:

I want to know if the configuration is valid earlier than that. At build time preferably. I don’t want my service to not start if part of it is invalid.

There are two points here, namely when to validate, and what to do with the results of validation.

Handling Validation Results

If your configuration is invalid, you’d think the service should fail to start, as it might be configured in a dangerous manner. While this makes sense for some service, others might need to work differently.

Say you have an API which supports both writing and reading of a certain type of resource. The read will return you a resource of some form, and the write side will trigger processing of a resource (and return you a 202 Accepted, obviously).

What happens if your configuration just affects the write side of the API? Should you prevent people from reading too? Probably not, but again it depends on your domain as to what makes sense.

Validating at Build Time

This is the far more interesting point (to me). How can we modify our build to validate that the environment’s configuration is valid? We have the code to do the validation: we have automated tests, and we have a configuration validator class (in this example, implemented using FluentValidation).

Depending on where your master configuration is stored, the next step can get much harder.

Local Configuration

If your configuration is in the current repository (as it should be) then it will be no problem to read.

public class ConfigurationTests
{
    public static IEnumerable<object[]> AvailableEnvironments => Enum
        .GetValues(typeof(Environments))
        .Cast<Environments>()
        .Select(e => new object[] { e });

    [Theory]
    [MemberData(nameof(AvailableEnvironments))]
    public void Environment_specific_configuration_is_valid(Environments environment)
    {
        var config = new ConfigurationBuilder()
            .AddJsonFile("config.json")
            .AddJsonFile($"config.{environment}.json", optional: true)
            .Build()
            .Get<AppConfiguration>();

        var validator = new AppConfigurationValidator();
        validator.ValidateAndThrow(config);
    }
}

Given the following two configuration files, we can make it pass and fail:

config.json:

{
  "Callback": "https://localhost",
  "Timeout": "00:00:30",
  "MaxRetries": 100
}

config.local.json:

{
  "MaxRetries": 0
}

Remote Configuration

But what if your configuration is not in the local repository, or at least, not completely there? For example, have a lot of configuration in Octopus Deploy, and would like to validate that at build time too.

Luckily Octopus has a Rest API (and acompanying client) which you can use to query the values. All we need to do is replace the AddJsonFile calls with an AddInMemoryCollection() and populate a dictionary from somewhere:

[Theory]
[MemberData(nameof(AvailableEnvironments))]
public async Task Octopus_environment_configuration_is_valid(Environments environment)
{
    var variables = await FetchVariablesFromOctopus(
        "MyDeploymentProjectName",
        environment);

    var config = new ConfigurationBuilder()
        .AddInMemoryCollection(variables)
        .Build()
        .Get<AppConfiguration>();

    var validator = new AppConfigurationValidator();
    validator.ValidateAndThrow(config);
}

Reading the variables from Octopus’ API requires a bit of work as you don’t appear to be able to ask for all variables which would apply if you deployed to a specific environment, which forces you into building the logic yourself. However, if you are just using Environment scoping, it shouldn’t be too hard.

Time Delays

Verifying the configuration at build time when your state is fetched from a remote store is not going to solve all your problems, as this little diagram illustrates:

test pass, a user changes value, deployment happens, startup fails

You need to validate in both places: early on in your process, and on startup. How you handle the configuration being invalid doesn’t have to be the same in both places:

  • In the build/test phase, fail the build
  • On startup, raise an alarm, but start if reasonable

Again, how you handle the configuration errors when your application is starting is down to your domain, and what your application does.

configuration, c#, strongtyping, stronk, validation

---

Feature Toggles with Consul

06 Sep 2018

Feature Toggles are a great way of helping to deliver working software, although there are a few things which could go wrong. See my talk Feature Toggles: The Good, The Bad and The Ugly for some interesting stories and insights on it!

I was talking with a colleague the other day about how you could go about implementing Feature Toggles in a centralised manner into an existing system, preferably with a little overhead as possible. The most obvious answer is to use a SAAS solution such as LauchDarkly, but what if you either don’t want to or can’t use a SAAS solution?

What if we already are using Consul for things such as service discovery, could we use the key-value store as a basic Feature Toggle service? It has a few advantages:

  • Consul is already in place, so there is no extra infrastructure required and no additional costs
  • Low stopping cost - If we decide we don’t want to use Consul, or not to use Toggles at all, we can stop
  • Low learning curve - we know how to use Consul already
  • Security - we can make use of Consul’s ACL to allow services to only read, and operators to write Feature Toggles.

There are also some downsides to consider too:

  • We’d effectively be reinventing the wheel
  • There won’t be any “value protection” on the settings (nothing stopping us putting an int into a field which will be parsed as a guid for example)
  • No statistics - we won’t be able to tell if a value is used still
  • No fine-grained control - unless we build some extra hierarchies, everyone gets the same value for a given key

So what would our system look like?

write to consul kv store, results distributed to other consul instances

It’s pretty straightforward. We already have a Consul Cluster, and then there are several machines with Consul clients running on them, as well as a Container Host with Consul too.

Any configuration written to a Consul node is replicated to all other nodes, so our user can write values to any node to get it to the rest of the cluster.

As mentioned earlier, we can use the ACL system to lock things down. Our services will have a read-only role, and our updating user will have a writeable role.

What Next?

Assuming this system covers enough of what we want to do, the next steps might be to make some incremental improvements in functionality, although again I would suggest looking into not reinventing the wheel…

Statistics

While we can’t use Consul to collect statistics on what keys are being read, we could provide this functionality by making a small client library which would log the queries and send them somewhere for aggregation.

Most microservice environments have centralised logging or monitoring (and if they don’t, they really should), so we can use this to record toggle usage.

This information would be useful to have in the same place you set the feature toggles, which brings us nicely onto the next enhancement.

User Interface

A simple static website could be used to read all the Toggles and their relevant states and statistics and provide a way of setting them. The UI could further be expanded to give some type safety, such as extra data indicating what type a given key’s value should be.

FIne Grained Values

Currently, everyone has the same value for a given key, but the system could be expanded to be more fine-grained. Rather than storing a feature toggle in the current form:

/kv/toggles/fast-rendering => true

We could add another level which would indicate a grouping:

/kv/toggles/fast-rendering/early-access => true
/kv/toggles/fast-rendering/others => false

At this point though, you are starting to add a lot of complexity. Think about whether you are solving the right problem! Choose where you are spending your Innovation Tokens.

Wrapping Up

Should you do this? Maybe. Probably not. I don’t know your system and what infrastructure you have available, so I don’t want to give any blanket recommendations.

I will, however, suggest that if you are starting out with Feature Toggles, go for something simple first. My current team’s first use of a Feature Toggle was just a setting in the web.config, and we just changed the value of it when we wanted the new functionality to come on.

See what works for you, and if you start needing something more complicated than just simple key-value toggles, have a look into an existing system.

microservices, consul, featuretoggles, architecture

---

Validate Your Configuration

26 Aug 2018

As I have written many times before, your application’s configuration should be strongly typed and validated that it loads correctly at startup.

This means not only that the source values (typically all represented as strings) can be converted to the target types (int, Uri, TimeSpan etc) but that the values are semantically valid too.

For example, if you have a web.config file with the following AppSetting, and a configuration class to go with it:

<configuration>
  <appSettings>
    <add key="Timeout" value="20" />
  </appSettings>
</configuration>
public class Configuration
{
    public TimeSpan Timeout { get; private set; }
}

We can now load the configuration using Stronk (or Microsoft.Extensions.Configuration if you’re on dotnet core), and inspect the contents of the Timeout property:

var config = new StronkConfig().Build<Configuration>();

Console.WriteLine(config.Timeout); // 20 days, 0 hours, 0 minutes, 0 seconds

Oops. A timeout of 20 days is probably a little on the high side! The reason this happened is that to parse the string value we use TimeSpan.Parse(value), which will interpret it as days if no other units are specified.

How to validate?

There are several ways we could go about fixing this, from changing to use TimeSpan.ParseExact, but then we need to provide the format string from somewhere, or force people to use Stronk’s own decision on format strings.

Instead, we can just write some validation logic ourselves. If it is a simple configuration, then writing a few statements inline is probably fine:

var config = new StronkConfig()
    .Validate.Using<Configuration>(c =>
    {
        if (c.Timeout < TimeSpan.FromSeconds(60) && c.Timeout > TimeSpan.Zero)
            throw new ArgumentOutOfRangeException(nameof(c.Timeout), $"Must be greater than 0, and less than 1 minute");
    });
    .Build<Configuration>();

But we can make it much clearer by using a validation library such as FluentValidation, to do the validation:

var config = new StronkConfig()
    .Validate.Using<Configuration>(c => new ConfigurationValidator().ValidateAndThrow(c))
    .Build<Configuration>();
public class ConfigurationValidator : AbstractValidator<Configuration>
{
    private static readonly HashSet<string> ValidHosts = new HashSet<string>(
        new[] { "localhost", "internal" },
        StringComparer.OrdinalIgnoreCase);

    public ConfigurationValidator()
    {
        RuleFor(x => x.Timeout)
            .GreaterThan(TimeSpan.Zero)
            .LessThan(TimeSpan.FromMinutes(2));

        RuleFor(x => x.Callback)
            .Must(url => url.Scheme == Uri.UriSchemeHttps)
            .Must(url => ValidHosts.Contains(url.Host));
    }
}

Here, not only are we checking the Timeout is in a valid range, but that our Callback is HTTPS and that it is going to a domain on an Allow-List.

What should I validate?

Everything? If you have properties controlling the number of threads an application uses, probably checking it’s a positive number, and less than x * Environment.ProcessorCount (for some value of x) is probably a good idea.

If you are specifying callback URLs in the config file, checking they are in the right domain/scheme would be a good idea (e.g. must be https, must be in a domain allow-list).

How do you check your configuration isn’t going to bite you when an assumption turns out to be wrong?

configuration, c#, strongtyping, stronk, validation

---

Branching and Red Builds

10 Aug 2018

So this is a bit of a rant…but hopefully with some solutions and workarounds too. So let’s kick things off with a nice statement:

I hate broken builds.

So everyone basically agrees on this point I think. The problem is that I mean all builds, including ones on shared feature branches.

Currently, I work on a number of projects which uses small(ish) feature branches. The way this works is that the team agrees on a new feature to work on creates a branch, and then each developer works on tasks, committing on their own branches, and Pull-Requesting to the feature branch. Once the feature branch is completed, it’s deployed and merged to master. We’ll ignore the fact that Trunk Based Development is just better for now.

branching, developers working on small tasks being merged into a feature branch

The problem occurs when one of the first tasks to be completed is writing behaviour (or acceptance) tests. These are written in something like SpecFlow, and call out to stubbed methods which throw NotImplementedException s. When this gets merged, the feature branch build goes red and stays red until all other tasks are done. And probably for a little while afterwards too. Nothing like “red-green-refactor” when your light can’t change away from red!

The Problems

  • Local tests are failing, no matter how much you implement
  • PullRequests to the feature branch don’t have passing build checks
  • The failing build is failing because:
    • Not everything is implemented yet
    • A developer has introduced an error, and no one has noticed yet
    • The build machine is playing up

branching, developers working on small tasks being merged into a feature branch showing everything as failed builds

Bad Solutions

The first thing we could do is to not run the acceptance tests on a Task branch’s build, and only when a feature branch build runs. This is a bad idea, as someone will have forgotten to check if their task’s acceptance tests pass, and will require effort later to fix the broken acceptance tests.

We could also implement the acceptance file and not call any stubbed methods, making the file a text file and non-executable. This is also a pretty bad idea - how much would you like to bet that it stays non-executable?

The Solution

Don’t have the acceptance tests as a separate task. Instead, split the criteria among the implementation tasks. This does mean that your other tasks should be Vertical Slices rather than Horizontal, which can be difficult to do depending on the application’s architecture.

An Example

So let’s dream up a super simple Acceptance Criteria:

  • When a user signs up with a valid email which has not been used, they receive a welcome email with an activation link.
  • When a user signs up with an invalid email, they get a validation error.
  • When a user signs up with an in-use email, they get an error

Note how this is already pretty close to being the tasks for the feature? Our tasks are pretty much:

  • implement the happy path
  • implement other scenarios

Of course, this means that not everything can be done in parallel - I imagine you’d want the happy path task to be done first, and then the other scenarios are probably parallelisable.

So our trade-off here is that we lose some parallelisation, but gain feedback. While this may seem insignificant, it has a significant impact on the overall delivery rate - everyone knows if their tasks are complete or not, and when the build goes red, you can be sure of what introduced the problem.

Not to mention that features are rarely this small - you probably have various separate acceptance criteria, such as being able to view an account page.

Oh, and once you can split your tasks correctly, there is only a small step to getting to do Trunk Based Development. Which would make me happy.

And developer happiness is important.

rant, git, ci, process, productivity, testing

---