An NGINX and DNS based outage

I recently encountered a behaviour in Nginx that I didn’t expect and caused a production outage in the process. While I would love to blame DNS for this, as it’s usually the cause of most network-related issues, in this case, the fault lies with Nginx.

I was running a very simple Nginx proxy, relaying an internal service to the outside world. The internal service is behind an AWS ALB, and the Nginx configuration was proxying to the ALB’s FQDN:

http {
  server {
    listen              8000;
    server_name         server.example.com;

    location ~* ^/some/path {
      proxy_pass              https://some.internal.alb.address.amazonaws.com;
      proxy_set_header        Host $host;
      proxy_read_timeout      120;
      proxy_ignore_headers    Cache-Control;
      proxy_ignore_headers    Expires;
      proxy_ignore_headers    Set-Cookie;
    }
  }
}

The proxy was working fine for several weeks, until suddenly it wasn’t. To make matters more strange, when we checked the internal site directly, it showed as up and responding. No deployments of any services had happened, and we had made no changes in any infrastructure either. We restarted the Nginx service, and everything started working again.

The first is that AWS’s can, and does, change the IP addresses associated with load balancers. This can happen for many unknown reasons as the underlying implementation of the AWS load balancers is a black box. One known reason is the load balancer scaling to handle more or less traffic. There is no API that we are aware of that allows you to see when these changes have happened; the only way we know is to run dig in a loop and send the results to our observability tool when they change.

The second detail is how Nginx resolves DNS. My initial expectation was that it worked like most DNS clients, and would query an address on the first request and then again after the TTL had elapsed. It turns out my assumption was wrong, and that by default, Nginx queries addresses once on startup, and never again.

So with these two facts, we can see why the proxy stopped working at some point; the target ALB had removed whichever IP address(es) Nginx had received from DNS at startup. There are two different ways this can be fixed.

The first way is to force Nginx to cache all IPs resolved for a fixed time window:

http {
+  resolver_timeout 30s;

  server {
    listen              8000;
    server_name         server.example.com;

    location ~* ^/some/path {

The second fix is to cause Nginx to re-resolve the upstream when it’s DNS record expires (based on the DNS TTL):

http {
  server {
    listen              8000;
    server_name         server.example.com;
+    set $upstream some.internal.alb.address.amazonaws.com;

    location ~* ^/some/path {
-     proxy_pass              https://some.internal.alb.address.amazonaws.com;
+     proxy_pass              https://$upstream;
      proxy_set_header        Host $host;
      proxy_read_timeout      120;
      proxy_ignore_headers    Cache-Control;
      proxy_ignore_headers    Expires;
      proxy_ignore_headers    Set-Cookie;
    }

While I am glad there are two easy ways to solve this issue, I still find the default “only resolve once at startup” behaviour odd, as it goes against the Principle of least surprise; I expect Nginx to re-query based on the TTL of the DNS Record. I suspect this behaviour exists for performance reasons, but I don’t know for sure.