- read

Maximizing Resilience with Graceful Shutdown in Cloud-Native Golang Applications

Benjamin Cane 96

Boosting Resilience with Graceful Shutdown in Cloud-Native Golang Applications

Benjamin Cane
Level Up Coding
Published in
11 min readFeb 1

--

Photo by Andrew Winkler on Unsplash

Graceful shutdown is vital to building highly resilient systems in a cloud-native environment. When we build cloud-native applications that follow the “cattle and not pets” philosophy, creating and destroying instances of applications is standard operating procedure.

This article will discuss why it’s important to implement graceful shutdown mechanisms and how to do so properly in Go. While we’ll use Go to implement these concepts, they can be applied to any cloud-native application regardless of language.

Why Graceful Shutdown is important

There are two main reasons it is essential to shut down applications gracefully. The first is to free up resources used by the applications, and the second is to help maintain transactional integrity.

Resource clean up

Let’s explore the first scenario, freeing up resources.

The above diagram shows a typical web application architecture; we have a client, a load balancer, several instances of our application, and a database.

Like most applications that connect to a database, our example above opens a pool of connections to the database. Typically these connections are opened either on boot or as needed and are re-used while processing requests to the application.

A well-designed application will cleanly close these connections on Shutdown. When this doesn’t happen, the connections are left lingering from a Database service perspective. Most databases can set connection limits by user, and it’s not atypical to see these limits set in production systems.

When an application doesn’t close its database connections correctly, these lingering connections will count against connection limits. This typically isn’t an issue if one or two instances are not shut down cleanly. Still, when many instances are shut down, these lingering connections can prevent newly started instances from connecting.

Transactional integrity

The second scenario can be even more important than avoiding resource contention.

Our new diagram shows an example of a financial application where an available balance is updated for each request, and an entry is made into a ledger.

Proper implementations of Graceful Shutdown will ensure that applications wait for all outstanding requests to finish before shutting down. If an application does not wait for outstanding requests, a situation might occur where requests are only half processed. In our example, that would mean the available balance is updated, but the ledger is not.

Depending on the application, this could be very problematic.

Critics will say that applications should be resilient enough to deal with this scenario because the application could crash due to uncontrollable circumstances. While that is true and accurate, we should be able to recover from half-processed transactions; we also should limit how we can get into situations where we have half-processed transactions occur.

Just because we can recover doesn’t mean we want to exercise that option with every instance shut down.

Steps for Implementing Graceful Shutdown

A poor implementation of a Graceful Shutdown can cause just as many, if not more, problems than no implementation. Let’s break down what a proper implementation should do step by step.

  1. Trap a Signal.
  2. Start returning an error for Readiness probes.
  3. Wait for traffic redirection (based on Readiness probe return).
  4. Stop Listener
  5. Wait for outstanding requests to complete
  6. Cleanly close resources (i.e., close file handlers, close database connections, etc.)
  7. Stop application

Within this list, there are two new concepts for some readers. Signals and Readiness Probes.

Signals

In Unix, Linux, and other POSIX-Compliant operating systems, signals are a process control mechanism sent to running processes. These signals are meant to tell applications when to stop, terminate, or continue and can be “trapped” or handled by the application.

Most have some experience with signals via the kill command. Users run this command to stop running processes. The way the command works is, it tells the Kernel to send a specific signal designated by the user (-15 in the diagram above) to the running process.

The running process is then forwarded the signal by the Kernel. The receipt of this signal is where the “trapping” becomes essential. When trapped, the application can perform tasks before shutting down.

There are some signals, such as SIGKILL and SIGSTOP which cannot be trapped by the application. These signals are handled at the Kernel level, where the Kernel itself will stop the process.

In some application runtime environments, such as Docker and Kubernetes, a process is sent a signal when stopped, but if that process is not stopped within a specified time, a SIGKILL will be used to stop the process forcefully.

Many different signals are used in various scenarios; however, some common ones can be found in the list below.

  • SIGHUP [1] — Generally used to signal configuration refresh or in multi-process services to indicate the control process should restart without restarting workers.
  • SIGINT [2] — Sent to a process from a CRTL+C press and used to stop the process.
  • SIGQUIT [3] — Sent to a process from a CTRL+\ press and used to stop the process and create a core dump.
  • SIGKILL [9] — This signal is never sent to the process as it triggers the Kernel to force stop the process.
  • SIGTERM [15] — The Default signal sent when executing the kill command with no options. This signal is the primary signal used to stop processes.

Readiness Probe

Kubernetes has popularized Readiness Probes, but the concept has been around for a long time. Within Kubernetes, the Readiness Probe is a health check (typically an HTTP call to a specific end-point) that determines an application's “readiness” to serve traffic.

These are different than the Liveness Probes, which serve the purpose of identifying if the application is running or needs to be restarted. A Readiness Probe should reflect the health of an application, but its primary purpose is a traffic control mechanism.

In the Context of application shutdown, the Readiness Probe can be used to move traffic away from the instance. This prevents new requests from being sent to the instance while performing a shutdown.

It is essential to fail Readiness Probes and wait for traffic to redirect before stopping any listeners. Suppose a listener is stopped, but traffic is not diverted away. In that case, some requests to this service will fail until the Readiness Probe detects the failure, which could be up to 30 seconds (defaults are a 10-second probe interval, three failures threshold).

Implementing Graceful Shutdown in Go

This section will create a simplified (small enough to fit in a blog post) application that implements a Graceful Shutdown as described above.

The purpose of this application is not to serve as an example of how to structure applications but purely how one could perform a controlled shutdown.

package main

import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)

func main() {
// Create an HTTP server with default settings
server := &http.Server{Addr: ":8080"}

// Create a context to be used for the runtime of the application
runCtx, runCancel := context.WithCancel(context.Background())

// Create a context to be used for the readiness of the application
readyCtx, readyCancel := context.WithCancel(context.Background())

// Create a handler for the readiness probe
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
// If the ready context has been cancelled, return unavailable status
if readyCtx.Err() != nil {
w.WriteHeader(http.StatusServiceUnavailable)
return
}

// Otherwise, set the status code to OK
w.WriteHeader(http.StatusOK)
})

// Create a handler for the liveness probe
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
// Set the status code to OK
w.WriteHeader(http.StatusOK)
})

// Start a goroutine to listen for signal traps and perform shutdown steps
go func() {
// Create a channel to receive signal traps
trap := make(chan os.Signal, 1)

// Register for the following signals: SIGTERM, SIGINT, SIGQUIT
signal.Notify(trap, syscall.SIGTERM, syscall.SIGINT, syscall.SIGQUIT)

// Wait for a signal to be received
<-trap

// Cancel the readiness context
readyCancel()

// Wait for the readiness probe to detect the failure
<-time.After(30 * time.Second)

// Shutdown the HTTP listener
err := server.Shutdown(context.Background())
if err != nil {
log.Printf("Error encountered while stopping HTTP listener: %s", err)
}

// Cancel the runtime context
runCancel()
}()

// Start the HTTP listener
err := server.ListenAndServe()
if err != nil && err != http.ErrServerClosed {
log.Printf("HTTP server returned error: %s", err)
}
log.Print("HTTP server shutdown")

// Wait for the shutdown steps to complete
<-runCtx.Done()
}

In addition to utilizing signal traps, the above code leverages the context package to coordinate Shutdown across goroutines. Let’s break down this code and explore how it all works together.

Create an HTTP server

 // Create an HTTP server with default settings
server := &http.Server{Addr: ":8080"}

Before doing anything else, we create an HTTP Server using the standard net/http package. This example uses mainly default settings except for the Addr setting which is set to listen on any network address on port 8080.

At this point, the server is not listening. Only an instance has been created.

Create a runtime context

 // Create a context to be used for the runtime of the application
runCtx, runCancel := context.WithCancel(context.Background())

Next, we create a Context type using the context package. Contexts are used to coordinate across multiple goroutines. They are cancellable, have a deadline, or have a timeout.

We will use a cancellable Context (runCtx) to coordinate when the application should stop executing. As we break down the application example, more examples of using Contexts will be shown.

Create a readiness context and handler

 // Create a context to be used for the readiness of the application
readyCtx, readyCancel := context.WithCancel(context.Background())

// Create a handler for the readiness probe
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
// If the ready context has been cancelled, return unavailable status
if readyCtx.Err() != nil {
w.WriteHeader(http.StatusServiceUnavailable)
return
}

// Otherwise, set the status code to OK
w.WriteHeader(http.StatusOK)
})

The snippet above first creates another Context named readyCtx which again is cancellable (via the use of context.WithCancel()).

It also creates a simple HTTP handler using the http.HandleFunc() method for the /ready path. The /ready path is where I typically point my Readiness Probes, but this could be set to any path.

What the Readiness Probe's HTTP handler is doing matters the most. The first thing this handler does is execute the context.Err() method.

This method will return an error (context.Canceled) when the Context has been canceled. If the Context has not been canceled, then the return value will be nil.

The function above looks to determine if the call runCtx.Err() returns a non-nil value. If yes, it returns an HTTP Service Unavailable (503) error to the client. Otherwise, if the call runCtx.Err() returns nil, then an OK (200) HTTP status code is returned.

When our external Readiness Probe performs an HTTP call against the /ready end-point, our method will be executed, and as long as our readyCtx Context is not canceled; our service will return a healthy response.

Creating a shutdown goroutine

The subsequent important piece of our example is where we kick off our shutdown goroutine.

 // Start a goroutine to listen for signal traps and perform shutdown steps
go func() {
// Create a channel to receive signal traps
trap := make(chan os.Signal, 1)

// Register for the following signals: SIGTERM, SIGINT, SIGQUIT
signal.Notify(trap, syscall.SIGTERM, syscall.SIGINT, syscall.SIGQUIT)

// Wait for a signal to be received
<-trap

// Cancel the readiness context
readyCancel()

// Wait for the readiness probe to detect the failure
<-time.After(30 * time.Second)

// Shutdown the HTTP listener
err := server.Shutdown(context.Background())
if err != nil {
log.Printf("Error encountered while stopping HTTP listener: %s", err)
}

// Cancel the runtime context
runCancel()
}()

Since listening for signals is a blocking operation, and starting an HTTP Listener is a blocking operation, one of these two things will need to sit in a new goroutine. Typically, I like to create a new goroutine that handles all signal trapping and initiates shutdown processes.

In the snippet above, the first three lines of our closure are all we need to trap signals. Let’s break these down for a second.

First, we create a channel of type os.Signal.

 // Create a channel to receive signal traps
trap := make(chan os.Signal, 1)

Second, we register that channel and any signals we wish to trap using the signal.Notify() function from package os/signal.

 // Register for the following signals: SIGTERM, SIGINT, SIGQUIT
signal.Notify(trap, syscall.SIGTERM, syscall.SIGINT, syscall.SIGQUIT)

The Notify() function will relay any signals sent to the application to the trap channel. From here, we can wait for a signal to be posted on the channel.

 // Wait for a signal to be received
<-trap

Once we’ve received the signal on the channel, we can start our graceful shutdown tasks. First, starting with canceling our readiness.

 // Cancel the readiness context
readyCancel()

// Wait for the readiness probe to detect the failure
<-time.After(30 * time.Second)

When we created our Context earlier, we used the context.WithCancel() method to produce a cancellable context. As part of the return of that method, we created a new function called readyCancel() . When this function is called, it will “cancel” our Context marking it as done. That means any calls to readyCtx.Err() like the one within our readiness HTTP handler will return an error context.Canceled.

After calling readyCancel() we should wait for traffic to be redirected before moving on to the next shutdown task, where we stop the HTTP listener.

 // Shutdown the HTTP listener
err := server.Shutdown(context.Background())
if err != nil {
log.Printf("Error encountered while stopping HTTP listener: %s", err)
}

// Cancel the runtime context
runCancel()

Once we are sure that traffic is no longer pointed to this instance, we can safely shut down the HTTP listener without disrupting new requests (as there are none). But what about existing requests?

the http.Server type has two methods for stopping the listener, http.Server.Close() and http.Server.Shutdown(). The Close() method will immediately cease all active listeners and close any open connections (that have not been hijacked, i.e., WebSockets). Whereas the Shutdown() method will stop all active listeners and close any idle connections. Active connections will be left alone until those connections are marked as idle.

What's nice about the Shutdown() method will block until all active connections are marked as idle and closed. This means we can wait for this method to return and then trigger the application to stop because the method doesn’t return until all outstanding requests have been processed.

To trigger our application to stop, we will need to cancel the runtime Context we created earlier runCtx. This Context can be canceled by calling the runCancel() function.

Starting the HTTP Listener

Now with our shutdown goroutine defined, we can start our HTTP Listener.

 // Start HTTP Listener
err := s.ListenAndServe()
if err != nil && err != http.ErrServerClosed {
log.Printf("HTTP Server returned error - %s", err)
}
log.Printf("HTTP Server shutdown")

// Wait for shutdown steps to complete
<-runCtx.Done()

The code above doesn’t just start the listener; it does a little more. When the s.ListenAndServe() method is executed, that method will create the HTTP Listener and block until the listener is stopped.

This means while our application is running, the primary goroutine is waiting for the s.ListenAndServe() method to return. As we discussed earlier, this will happen when a signal is caught, and the s.Shutdown() method is called.

While the s.Shutdown() method will wait to return until all connections are closed, and all outstanding requests are processed, the s.ListenAndServe() method will not. This method will return as soon as the listener is stopped, which can happen before all requests are processed. This is why we are using the context.Done() method from our runtime context.

The context.Done() method will return a channel receiving an empty Struct upon the Context being marked as done. What this means is when our shutdown goroutine from earlier runs the runCancel() function, an empty Struct will be received by the channel runCtx.Done() and our application will complete execution by returning from within the main() function.

Thus, completing our graceful Shutdown.

Summary

This article taught us the importance of implementing graceful shutdown in cloud-native applications and how to do so in Go. We also discussed signals, readiness probes, and Go contexts. For further information on how signals work within containers, check out my previous article on creating entry-point scripts for Docker containers.