- read

Profile Guided Optimization (PGO) in Go

Elif Seray Dönmez Çelik 72

With Go 1.20, the Go compiler started to support Profile Guided Optimization (will be referred to as PGO) mechanism to optimize builds. In this article, I will show you my experiences with PGO and how you can use it in your projects.

Introduction

Before further explaining how to enable PGO in our projects, I want to briefly introduce what is PGO.

There are many optimizations that the compiler can’t or won’t perform because the compiler doesn’t have the necessary context about what happens at runtime. The compiler does not know what the values will be at compile time. It may only be known at runtime. For example, an expensive division instruction may not be replaced by a much less expensive write-shift operation.

PGO, also known as feedback-directed optimization (FDO), is used for optimizing the performance of an application without changing a single line of code by using the profiles collected during runtime.

BuyBox Calculation Function

To be able to see the real effects of PGO on performance, I’ve used a function that has a lot of mathematical calculations — in other words, CPU operations.

There are certain metrics used in the calculation of the buybox* system, and each metric has a certain rate. In this function, we are calculating the buybox score of a product by calculating each multiplier according to the given rates.

How to Use

  • The compiler uses CPU profiles as input for optimization. So first of all we need to enable pprof in our application. You can use any library for collection profiles like runtime/pprof and net/http/pprof. In my trials, I’ve used gin-contrib/pprof.
pprof.Register(router)
  • We should collect our first profiles from an initial binary without PGO. You can collect these profiles from production, testing environments, or from a representative benchmark. Because the microbenchmarks are represents only a small portion of your application, the optimizations are also small. The most important thing is that the profiles should represent the real behavior of your application. To collect accurate profile data; we should collect profiles from different instances of your production application at different times and merge them into a single profile.
// to collect profiles for 30 seconds
http://url_of_your_application.com/debug/pprof/profile?seconds=30

// to merge multiple pprof files
go tool pprof -proto profile1 profile2 > merged
  • Next, we will use the collected profiles in our application’s build step. Go needs a single pgo file for each main package so first of all we need to convert our pprof file to a .gpo file and name it default.pgo. After preparing the pgo file, we need to put it in the source directory of the main package. We should commit this file to repo because it will be the input of our build.
  • To enable pgo, we should add -pgo flag in the go build command. In Go 1.20, the pgo flag is set to off by default but with Go 1.21, the default setting will be -pgo=auto. If you wonder what else comes with Go 1.21, you can read my friend’s article. If you have a complex scenario like if you need to use different profiles for different scenarios in one binary, you can set your pgo files’ paths instead of using “auto”.
go build -pgo=off // disables pgo. default in go 1.20
go build -pgo=auto // enables pgo and uses default.pgo file under source directory. default in go 1.21
go build -pgo=%path_of_pgo_file/name_of_pgo_file.pgo% // enables pgo and uses the given pgo file at the specified path
  • After completing the steps above, we can finally build our application and release it. In this step, we can finally compare the performance of current and previous builds.
  • For continuous optimization, we should continue collecting profiles from production after enabling pgo. When we need to release a different binary, we can use these new profiles as input.

Go calls this workflow an “Iterative Lifecycle” and we can summarize it in four items:

  1. Build and release a binary without PGO
  2. Collect profiles from Production (preferably)
  3. When you need to release a different binary, build the latest source with the prod profile
  4. Continue with step 2

Results

I’ve done my experiments in our testing environment so the CPU usage and memory usage don’t represent our real load. The load is a small portion of the production application usage.

Without PGO

My initial build of the application was without pgo. I had 1 pod in the testing environment. Its average CPU usage was 0.0529 and memory usage was 39.6MB.

without pgo

I also wanted to compare the execution speed of the function when I give it the same parameters. I used the code below to create a load in my local environment and calculated the elapsed time of 1000 function calls. The average elapsed time was 475.81 ms.

Elapsed Times:
average = 475.81522404 ms
max = 498.559668 ms
min = 468.904482 ms
for i := 0; i < 25; i++ {
start := time.Now()
for i := 0; i < 1000; i++ {
_, _ = service.CalculateBuyBoxScores(request)
}
elapsed := time.Since(start)
fmt.Println("Elapsed time: ", elapsed)
}

With PGO

I collected some profiles during the first step and used them for my second step. Using the profiles from a build without pgo, I built a new binary with pgo enabled and released it. The average CPU usage of the new binary was 0.0597 and memory usage was 35.5 MB. I calculated the elapsed time of 1000 function calls using the code block in the first step and the average elapsed time was 474.04 ms.

Even though the memory usage and average elapsed time were decreased, the CPU usage was increased. Also, the elapsed time was higher than my expectations so I’ve decided to try to optimize the performance even more.

Elapsed Times:
average = 474.03999344 ms
max = 488.723976 ms
min = 464.657134 ms
pgo applied — first iteration

With PGO — 2nd Iteration

To be able to optimize the performance, I collected the profiles from the binary with pgo so the compiler can learn from the already optimized version. After collecting profiles, I built the application again with pgo. The average CPU usage of the new binary was 0.0489 and memory usage was 35.1 MB. The average elapsed time of 1000 function calls was 471.66 ms.

In this second iteration, I’ve finally seen the improvement I was expecting. The execution time of the function with the memory and CPU usage were improved.

Elapsed Times:
average = 471.65862048 ms
max = 491.186735 ms
min = 467.712773 ms
pgo applied — second iteration

Conclusion

The Go documentation specified that the expected performance increase was 2–7%. From my experience, the operation time of the function decreased from 475.81 ms to 471.66 ms (an improvement of 4.11%).

We need to commit the pgo file (with a size of around 12 KB). As the Go documentation mentions, the binary sizes may increase due to additional function inlining.

The documentation said the build times may increase after the pgo is enabled but I didn’t see a significant change in that matter.

In summary, I could see some performance improvement even in the testing environment and I would like to give PGO a try and see its effects in production also.

(*) Buybox: When more than one seller sells the same product, the system that selects the seller that will provide maximum benefit to the customers with the determined algorithm metrics and moves the product of this seller up is called a “buybox”.