- read

A Deep Look Into Golang Profile-Guided Optimization (PGO)

Yahya SayadArbabi 71

Photo by James Harrison on Unsplash

Intro

Profile-Guided Optimization (PGO) is an optimization method which improves final compiled binary by using profile data hints and compiling the code based on those profiles.

There are several optimization mechanisms that compilers usually get into account when compiling your code to binary. Like dead code elimination, Register allocation, Constant folding or function inlining; you may split your code into smaller and smaller functions and different level of abstractions to ease up future changes and modifications, but from compiler point of view multiple calls to different functions may not be very optimize and sometimes compiler decides to inline your functions!

Although there’s a limit to these optimizations. Compiler can’t inline everything! this will result in bloating binary files, L-Cache overhead and performance degradation. But with providing information about how the program is being run and which functions are being constantly called, the compiler can inline those hot functions and reduce function call and return overhead. It’s the same with register allocation and other methods.

How does it work?

From go version v1.21, you only need to provide:

  1. A pprof CPU profile, which can be collect with runtime/pprof or net/http/pprof
  2. Renaming it to default.pgo and move it to same directory as main.go or pass it to pgo flag in go build command

Let’s take a look at the following code and see how it works:

By running go build -o before_pgo ./main.go we compile the code and create the before_pgo file in the same directory and by running that binary with ./before_pgo we can create the default.pgo that we need to optimize our binary.

Before going any further, let’s see how the compiler uses our fancy default.pgo file to improve things. By running go build -pgo=off -gcflags -m main.go we get the following results: -gcflags -m flag is to see compiler optimizations and since we already have default.pgo we need to use -pgo=off to not use it for now.

# command-line-arguments
./main.go:26:6: can inline Func2
./main.go:27:13: inlining call to fmt.Println
./main.go:22:13: inlining call to fmt.Println
./main.go:23:7: inlining call to Func2
./main.go:23:7: inlining call to fmt.Println
./main.go:11:19: inlining call to os.Create
./main.go:22:13: ... argument does not escape
./main.go:22:14: "print in func #1" escapes to heap
./main.go:23:7: ... argument does not escape
./main.go:23:7: "print in func #2" escapes to heap
./main.go:27:13: ... argument does not escape
./main.go:27:14: "print in func #2" escapes to heap

As you can see, the compiler is inlining a lot of things, such as our Func2, Very good without any PGO, right?

Now Let’s try it out with fancy -pgo option, Running go build -pgo=auto -gcflags -m main.go will give us:

# command-line-arguments
./main.go:26:6: can inline Func2
./main.go:27:13: inlining call to fmt.Println
./main.go:21:6: can inline Func1
./main.go:22:13: inlining call to fmt.Println
./main.go:23:7: inlining call to Func2
./main.go:11:19: inlining call to os.Create
./main.go:17:8: inlining call to Func1
./main.go:27:13: ... argument does not escape
./main.go:27:14: "print in func #2" escapes to heap
./main.go:17:8: ... argument does not escape
./main.go:17:8: "print in func #1" escapes to heap
./main.go:22:13: ... argument does not escape
./main.go:22:14: "print in func #1" escapes to heap
./main.go:23:7: ... argument does not escape
./main.go:23:7: "print in func #2" escapes to heap

Now we can see that our Func1 is also inlined, And why is that? It’s because by default, go compiler doesn’t inline any non-leaf function. But by providing some profile data and telling it how our program is actually working! It understands that we’re calling that function 1,000,000 times and it’s better to be inlined.

Dig deeper inside the binaries!

Now let’s compile our code, this time with profile data that we have and compare the binaries. Now that we have default.pgo in the same directory as main.go, let’s run go build -o after_pgo ./main.go and first of all, let’s compare the size of our binaries by running ls -atrl we get:

total 4424
drwxr-xr-x 4 yahya yahya 4096 Aug 18 18:25 ..
-rw-r--r-- 1 yahya yahya 65 Aug 18 18:25 go.mod
-rw-r--r-- 1 yahya yahya 407 Aug 18 18:26 main.go
-rwxr-xr-x 1 yahya yahya 2250107 Aug 18 18:41 before_pgo
-rw-r--r-- 1 yahya yahya 2639 Aug 18 18:44 default.pgo
-rwxr-xr-x 1 yahya yahya 2252917 Aug 18 19:02 after_pgo
drwxr-xr-x 2 yahya yahya 4096 Aug 18 19:06 .

after_pgo file is slightly bigger, and we know that’s because of some inlining. Let’s checkout inside these binaries and see what we can find. I’m going to use objdump that exists in the go tool chain, so we can run go tool objdump before_pgo and go tool objdump after_pgo and compare the result. The following is only the part of inlining the Func1 version.

The following is before and after pgo objdump binaries:

// before_pgo
TEXT main.main(SB) /home/yahya/go/src/github.com/theyahya/go-pgo-playground/example1/main.go
main.go:9 0x4a32e0 493b6610 CMPQ SP, 0x10(R14)
main.go:9 0x4a32e4 0f86c1000000 JBE 0x4a33ab
main.go:9 0x4a32ea 55 PUSHQ BP
main.go:9 0x4a32eb 4889e5 MOVQ SP, BP
main.go:9 0x4a32ee 4883ec50 SUBQ $0x50, SP
main.go:9 0x4a32f2 440f117c2440 MOVUPS X15, 0x40(SP)
main.go:9 0x4a32f8 c644242700 MOVB $0x0, 0x27(SP)
main.go:11 0x4a32fd 90 NOPL
file.go:323 0x4a32fe 488d05590b0200 LEAQ 0x20b59(IP), AX
file.go:323 0x4a3305 bb0b000000 MOVL $0xb, BX
file.go:323 0x4a330a b942020000 MOVL $0x242, CX
file.go:323 0x4a330f bfb6010000 MOVL $0x1b6, DI
file.go:323 0x4a3314 e88709feff CALL os.OpenFile(SB)
main.go:12 0x4a3319 440f117c2430 MOVUPS X15, 0x30(SP)
main.go:12 0x4a331f 488d159a000000 LEAQ main.main.func1(SB), DX
main.go:12 0x4a3326 4889542430 MOVQ DX, 0x30(SP)
main.go:12 0x4a332b 4889442438 MOVQ AX, 0x38(SP)
main.go:12 0x4a3330 488d542430 LEAQ 0x30(SP), DX
main.go:12 0x4a3335 4889542448 MOVQ DX, 0x48(SP)
main.go:12 0x4a333a c644242701 MOVB $0x1, 0x27(SP)
main.go:13 0x4a333f 4889c3 MOVQ AX, BX
main.go:13 0x4a3342 488d05bf2d0400 LEAQ go:itab.*os.File,io.Writer(SB), AX
main.go:13 0x4a3349 e89259ffff CALL runtime/pprof.StartCPUProfile(SB)
main.go:14 0x4a334e 488d1503880200 LEAQ 0x28803(IP), DX
main.go:14 0x4a3355 4889542440 MOVQ DX, 0x40(SP)
main.go:14 0x4a335a c644242703 MOVB $0x3, 0x27(SP)
main.go:14 0x4a335f 31c0 XORL AX, AX
main.go:16 0x4a3361 eb12 JMP 0x4a3375
main.go:16 0x4a3363 4889442428 MOVQ AX, 0x28(SP)
main.go:17 0x4a3368 e8b3000000 CALL main.Func1(SB)
main.go:16 0x4a336d 488b442428 MOVQ 0x28(SP), AX
main.go:16 0x4a3372 48ffc0 INCQ AX
main.go:16 0x4a3375 483d40420f00 CMPQ AX, $0xf4240
main.go:16 0x4a337b 7ee6 JLE 0x4a3363
main.go:19 0x4a337d c644242701 MOVB $0x1, 0x27(SP)
main.go:19 0x4a3382 e8395dffff CALL runtime/pprof.StopCPUProfile(SB)
main.go:19 0x4a3387 c644242700 MOVB $0x0, 0x27(SP)
main.go:19 0x4a338c 488b542448 MOVQ 0x48(SP), DX
main.go:19 0x4a3391 488b02 MOVQ 0(DX), AX
main.go:19 0x4a3394 ffd0 CALL AX
main.go:19 0x4a3396 4883c450 ADDQ $0x50, SP
main.go:19 0x4a339a 5d POPQ BP
main.go:19 0x4a339b c3 RET
main.go:19 0x4a339c 0f1f4000 NOPL 0(AX)
main.go:19 0x4a33a0 e8fbd5f8ff CALL runtime.deferreturn(SB)
main.go:19 0x4a33a5 4883c450 ADDQ $0x50, SP
main.go:19 0x4a33a9 5d POPQ BP
main.go:19 0x4a33aa c3 RET
main.go:9 0x4a33ab e830a1fbff CALL runtime.morestack_noctxt.abi0(SB)
main.go:9 0x4a33b0 e92bffffff JMP main.main(SB)
// after_pgo
TEXT main.main(SB) /home/yahya/go/src/github.com/theyahya/go-pgo-playground/example1/main.go
main.go:9 0x4a3b80 493b6610 CMPQ SP, 0x10(R14)
main.go:9 0x4a3b84 0f86f2000000 JBE 0x4a3c7c
main.go:9 0x4a3b8a 55 PUSHQ BP
main.go:9 0x4a3b8b 4889e5 MOVQ SP, BP
main.go:9 0x4a3b8e 4883ec60 SUBQ $0x60, SP
main.go:9 0x4a3b92 440f117c2450 MOVUPS X15, 0x50(SP)
main.go:9 0x4a3b98 c644242700 MOVB $0x0, 0x27(SP)
main.go:11 0x4a3b9d 90 NOPL
file.go:323 0x4a3b9e 488d05b9020200 LEAQ 0x202b9(IP), AX
file.go:323 0x4a3ba5 bb0b000000 MOVL $0xb, BX
file.go:323 0x4a3baa b942020000 MOVL $0x242, CX
file.go:323 0x4a3baf bfb6010000 MOVL $0x1b6, DI
file.go:323 0x4a3bb4 e84704feff CALL os.OpenFile(SB)
main.go:12 0x4a3bb9 440f117c2430 MOVUPS X15, 0x30(SP)
main.go:12 0x4a3bbf 488d15da000000 LEAQ main.main.func1(SB), DX
main.go:12 0x4a3bc6 4889542430 MOVQ DX, 0x30(SP)
main.go:12 0x4a3bcb 4889442438 MOVQ AX, 0x38(SP)
main.go:12 0x4a3bd0 488d542430 LEAQ 0x30(SP), DX
main.go:12 0x4a3bd5 4889542458 MOVQ DX, 0x58(SP)
main.go:12 0x4a3bda c644242701 MOVB $0x1, 0x27(SP)
main.go:13 0x4a3bdf 4889c3 MOVQ AX, BX
main.go:13 0x4a3be2 488d052f260400 LEAQ go:itab.*os.File,io.Writer(SB), AX
main.go:13 0x4a3be9 e89259ffff CALL runtime/pprof.StartCPUProfile(SB)
main.go:14 0x4a3bee 488d15637f0200 LEAQ 0x27f63(IP), DX
main.go:14 0x4a3bf5 4889542450 MOVQ DX, 0x50(SP)
main.go:14 0x4a3bfa c644242703 MOVB $0x3, 0x27(SP)
main.go:14 0x4a3bff 31c0 XORL AX, AX
main.go:16 0x4a3c01 eb47 JMP 0x4a3c4a
main.go:16 0x4a3c03 488d5001 LEAQ 0x1(AX), DX
main.go:16 0x4a3c07 4889542428 MOVQ DX, 0x28(SP)
main.go:17 0x4a3c0c 90 NOPL
main.go:22 0x4a3c0d 440f117c2440 MOVUPS X15, 0x40(SP)
main.go:22 0x4a3c13 488d35868a0000 LEAQ 0x8a86(IP), SI
main.go:22 0x4a3c1a 4889742440 MOVQ SI, 0x40(SP)
main.go:22 0x4a3c1f 488d3dea200400 LEAQ 0x420ea(IP), DI
main.go:22 0x4a3c26 48897c2448 MOVQ DI, 0x48(SP)
main.go:22 0x4a3c2b 488d442440 LEAQ 0x40(SP), AX
main.go:22 0x4a3c30 bb01000000 MOVL $0x1, BX
main.go:22 0x4a3c35 4889d9 MOVQ BX, CX
main.go:22 0x4a3c38 e8234dfeff CALL fmt.Println(SB)
main.go:22 0x4a3c3d 0f1f00 NOPL 0(AX)
main.go:23 0x4a3c40 e8bb000000 CALL main.Func2(SB)
main.go:16 0x4a3c45 488b442428 MOVQ 0x28(SP), AX
main.go:16 0x4a3c4a 483d40420f00 CMPQ AX, $0xf4240
main.go:16 0x4a3c50 7eb1 JLE 0x4a3c03
main.go:19 0x4a3c52 c644242701 MOVB $0x1, 0x27(SP)
main.go:19 0x4a3c57 e8045dffff CALL runtime/pprof.StopCPUProfile(SB)
main.go:19 0x4a3c5c c644242700 MOVB $0x0, 0x27(SP)
main.go:19 0x4a3c61 488b542458 MOVQ 0x58(SP), DX
main.go:19 0x4a3c66 488b02 MOVQ 0(DX), AX
main.go:19 0x4a3c69 ffd0 CALL AX
main.go:19 0x4a3c6b 4883c460 ADDQ $0x60, SP
main.go:19 0x4a3c6f 5d POPQ BP
main.go:19 0x4a3c70 c3 RET
main.go:19 0x4a3c71 e80acdf8ff CALL runtime.deferreturn(SB)
main.go:19 0x4a3c76 4883c460 ADDQ $0x60, SP
main.go:19 0x4a3c7a 5d POPQ BP
main.go:19 0x4a3c7b c3 RET
main.go:9 0x4a3c7c 0f1f4000 NOPL 0(AX)
main.go:9 0x4a3c80 e83b9afbff CALL runtime.morestack_noctxt.abi0(SB)
main.go:9 0x4a3c85 e9f6feffff JMP main.main(SB)

In before_pgo it’s calling Func1 but in after_pgo it’s being inlined.

A more realistic example (API)

Now that we’ve got a grasp of what’s going on, let’s do a more realistic example; an http endpoint which returns a simple json. I’m going to use github.com/json-iterator/go which I found out about a couple days ago in one of my endless github explorations!

It’s a simple endpoint that reads a .json file and unmarshal and marshal (we do this to create some fake overhead to this endpoint) again and then write it to response. I also Added net/http/pprof to be able to collect profile data.

Now we can go build -o before_pgo ./main.go and then run it by ./before_pgo. While it’s up we create traffic with go-wrk -d 20 http://localhost:8080 (Check out github.com/tsliwowicz/go-wrk). And while the traffic is going through, we can collect the profile data for 10 seconds with curl -o default.pgo "http://localhost:8080/debug/pprof/profile?seconds=10"

Result of go-wrk command should be something like this:

Running 20s test @ http://localhost:8080
10 goroutine(s) running concurrently
747453 requests in 19.582312899s, 322.20MB read
Requests/sec: 38169.80
Transfer/sec: 16.45MB
Avg Req Time: 261.987µs
Fastest Request: 33.156µs
Slowest Request: 15.862181ms
Number of Errors: 0

Now let’s recompile our code with the profile data that we have and then run the benchmark again and compare the results. With go build -o after_pgo ./main.go and checking the files that we have, we can see that the size of the binary is already 200KB bigger and this means that probably lots of inlining should’ve happened there!

-rwxr-xr-x 1 yahya yahya 8.3M Aug 22 17:35 before_pgo
-rw-r--r-- 1 yahya yahya 60K Aug 22 17:38 default.pgo
-rw-r--r-- 1 yahya yahya 1017 Aug 22 17:47 main.go
drwxr-xr-x 2 yahya yahya 4.0K Aug 22 18:21 .
-rwxr-xr-x 1 yahya yahya 8.5M Aug 22 18:21 after_pgo

And this is the go-wrk result.

Running 20s test @ http://localhost:8080
10 goroutine(s) running concurrently
821349 requests in 19.574756327s, 354.05MB read
Requests/sec: 41959.60
Transfer/sec: 18.09MB
Avg Req Time: 238.324µs
Fastest Request: 28.559µs
Slowest Request: 12.996215ms
Number of Errors: 0

Not bad! Increased number of requests that we can handle by almost 4,000.

If you’re interested about how compiler does these magics, I recommend you to check out these first commits in go project: https://github.com/golang/go/commit/99862cd57dd9406180f238e8fa2fcc1d435fb163.

It’s the first commit of PGO implementation in the Go compiler and only by scrolling you can get a grasp of how it’s working and how it’s checking for hot nodes in AST.

Automation & Workflow

For profile data to be completely accurate and useful, it needs to come from production with real traffic. But it’s hard to keep an updated profile and constantly replace it. It also could easily become stale or even if you have a big refactor with big changes, your previous profile data from production isn’t useful to compile your code with.

With less than 5% overhead, It’s safe to use profiler in your production. Even major cloud providers like GCP offer solutions to collect your profile data from their native cloud products with a nice dashboard to see the results. It’s also very helpful in cases that you’ve goroutine/memory leak or trying to debug any performance related issue that is only happening in production and hard to reproduce locally!

With that out of the way (having profiler as a default in your code), Automating your builds to use profile data and, as the design proposal offers: having it in your git history as a file (default.pgo) could be an overkill. But if you really care about performance or/and cost, and you’ve exhausted all the other improvements, this could be your chance to make it perfect!

Code

You can find all the codes that we wrote in this post and the related files and benchmarks in this github repo: github.com/theyahya/go-pgo-playground

If you like these kind of topics, join my 📮 Newsletter to receive similar articles every once in a while, there won’t be any spams! promise!

Further reading

Still interested and curious about the PGO and how it works and you want to follow the rabbit hole? I recommend the following links:

  1. PGO related commits in go compiler: https://github.com/golang/go/commits/master/src/cmd/compile/internal/pgo
  2. AutoFDO: Automatic Feedback-Directed Optimization for Warehouse-Scale Applications
  3. Proposal: profile-guided optimization
  4. Profile-guided optimizations