What is profiling?
How to create a CPU profile of your program.
How to create a Heap profile of your program.
How to use the pprof
tool to analyze a profile.
Some common CPU usage optimization techniques.
Profile
pprof
Protocol Buffers
protoc
CPU
Heap profile
Profiling is a program optimization technique.“To profile a program” means to collect detailed statistics about how a program runs. Those statistics can be CPU usage, memory allocation, time spent on a program routines, number of function calls ... etc.
\nBut what is the difference with a benchmark? A benchmark collects runtime information about a specific function. Profiling is the collection of statistics for the whole program.
\n\nProfiling is often used when a performance drop is observed. The tool is used to understand why a program underperforms.
\nThe codebase’s static analysis can be insufficient to detect why the program behaves badly. Benchmarks test an isolated function’s performance; they are insufficient to understand the whole picture.
\nProfiling can also be a tool to improve program engineering. Go is a relatively performant language, but badly designed programs can suffer from performance issues. Those issues can be easily understood and corrected with profiling’s gracious help.
\n\nTo profile a program, we can use the runtime/pprof package that exposes the necessary API to start and stop profiling.
\nIn this section, we will profile a program that sums integers. The program consists only of a main package with one function named doSum. This function will sum integers from 0 to 787766776 :
\npackage main\n\nimport (\n "fmt"\n)\n\nfunc main() {\n result := doSum()\n fmt.Println(result)\n}\n\nfunc doSum()int{\n sum := 0\n for i := 0; i < 787766777; i++ {\n sum += i\n }\n return sum\n}
\nThe next step is to add the call to the pprof API inside our main function :
\n// profiling/getting-started/main.go \n\nf, err := os.Create("profile.pb.gz")\nif err != nil {\n log.Fatal(err)\n}\nerr = pprof.StartCPUProfile(f)\nif err != nil {\n log.Fatal(err)\n}\ndefer pprof.StopCPUProfile()
\nThis block of code has to be placed inside the main
. It will create a file named \"profile.pb.gz\"
. Then it will start CPU profiling and write the profile result onto this file (pprof.StartCPUProfile(f)
). At the end of the main function, we will call pprof.StopCPUProfile()
. To improve reading, we have used a deferred statement.
We then have to build our program (we will call the binary \"gettingstarted\"
) :
$ go build -o gettingstarted main.go
\nIt will create the binary file in the current directory. To collect the data, we then have to launch our program :
\n$ ./gettingstarted
\nYou can see that a profile.pb.gz file has been created. You can try to open the file to visualize the result, but that’s not a very good idea. This file is compressed, and second. We will have to use a tool to visualize our profiling results!
\nThis program is pprof, Google has developed it to enable “visualization and analysis of profiling data”1. Pprof can read profiles and generate reports about them in a easily readable way. For this getting started, we will just use the pprof command-line interface to visualize our profiling statistics (we will see other visualization techniques later in this chapter) :
\n$ go tool pprof gettingstarted cpu.profile
\nWe simply invoke go tool pprof
with as first argument, the path to the binary (gettingstarted
) of our program and then the profile file (cpu.profile
). This command will launch the interactive mode. You will have to type commands to display the statistics :
$ go tool pprof gettingstarted profile.pb.gz\nFile: gettingstarted\nType: cpu\nTime: Jan 1, 2021 at 9:27pm (CET)\nDuration: 413.54ms, Total samples = 220ms (53.20%)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof)
\nIn the standard output, you can see that pprof displays :
\nthe name of the binary file (here : gettingstarted
)
the type of profile (here cpu
)
the date on which we generated the profile
the total duration of program execution (413ms
)
the total samples (we will get back to what a sample is later because it can be confusing)
To be able to read profile files, we need to install a program provided by google : protoc. Why? Because profile files have a specific format, they are protobuf files.
\n\nProtocol Buffers have been developed by Google internally and made open source. It can transform structured data into a lightweight format that can be stored and transmitted over a network. The process of transforming structured data into a specific format is called serialization. Data serialized (or encoded) with this method is very light. The format returned is called “binary wire”.
\nUnlike XML or JSON, the fields’ names are not translated into the data’s serialized version. Therefore the size of the message is lighter. You need some sort of specification to read a serialized message. The serialization specification is a “proto file”. Proto files have the .proto extension.
\nThe Google team has developed tools in many languages (C++, C#, Dart, Go, Java, Python, Ruby, Objective-C...) to serialize and deserialize data easily. In the next section, we will use one of these tools to deserialize a profile file.
\n\nYou first have to download the latest release of the compiler. At the time of writing, the latest version is 3.15.2. I will give you the command to download this specific release in the following lines. Please always download the latest version! To get the latest version number, check the GitHub page https://github.com/protocolbuffers/protobuf/releases.
\nWe will download the already compiled version of the software. Of course, you can build it yourself if you have a C++ compiler installed on your computer, but that’s not the easiest solution.
\nYou can directly download the zipped files on
\nhttps://github.com/protocolbuffers/protobuf/releases
\nYou can also use those terminal commands :
\n$ curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.15.2/protoc-3.15.2-linux-x86_64.zip
\n$ curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.15.2/protoc-3.15.2-osx-x86_64.zip
\nFor Mac and Linux, we used cURL, a CLI tool to make HTTP requests. We are passing two flags to the command :
\nwill write the downloaded content to a file (named the same way as the file on the server)
\nwill follow redirection because GitHub is storing the releases on an Amazon AWS S3 bucket (a cloud storage service)
\nFor Windows users the URL to download the release is:
\nhttps://github.com/protocolbuffers/protobuf/releases/download/v3.6.1/protoc-3.6.1-win32.zip
\nFor Linux and Mac users, you can use the command line to unzip the files thanks to the unzip utility :
\n$ cd where/you/downloaded/the/zip/file\n$ unzip protoc-3.15.2-osx-x86_64.zip -d protoc-3.15.2
\nThe -d
flag will put the inflated files into the specified directory (it will be created if it does not exist).
For Windows users, I advise you to use the graphical interface to unzip the files.
\n\nFor Linux and Mac users, there is one convenient place where you can put your binaries : /usr/local/bind
. The directory /usr/local
is used to install software locally. The bin
folder will hold local binaries. If you are curious about the UNIX filesystem hierarchy, take a look at the specification: http://refspecs.linuxfoundation.org/FHS_2.3/fhs-2.3.html
$ sudo mv protoc-3.15.2/bin/protoc /usr/local/bin
\nYou will need to sudo to move the executable protoc-3.15.2/bin/protoc
into the usr
folder.
For windows users, you will need to add the binary to the PATH environment variable.
\n\nTo get a decoded version of the profile file, the first step is to get the .proto file. Encoded protocol buffers do not contain field names to reduce their size.
\nWe can find the .proto file on the pprof GitHub repository at the following address: https://github.com/google/pprof/blob/master/proto/profile.proto. Go is shipped with a vendored version of pprof that you can find inside the vendor directory inside the go source folders (into src/cmd/vendor/GitHub.com/google/pprof/). However, the profile.proto seems not to be included with the vendored version at the time of writing.
\nOne solution is to clone pprof in your src folder
\n$ cd /path/to/your/dev/directory\n$ git clone https://github.com/google/pprof.git
\nThen we can use the proto file downloaded. The profile returned is gzipped (it’s compressed). We must first unzip it. To do so, we will use the gunzip command (available for Linux and Mac users) :
\n$ gunzip profile.pb.gz
\nThis will delete the file profile.pb.gz and create a new file named profile.pb (which is the unzipped version of profile.pb.gz). For Windows users, you can use a GUI tool to achieve this.
\nThen we can decode the protocol buffer file
\n$ protoc --decode perftools.profiles.Profile /path/to/your/dev/directory/pprof/proto/profile.proto --proto_path /path/to/your/dev/directory/pprof/proto < profile.pb
\nOn my computer :
\n$ protoc --decode perftools.profiles.Profile /Users/maximilienandile/Documents/DEV/pprof/proto/profile.proto --proto_path /Users/maximilienandile/Documents/DEV/pprof/proto < profile.pb
\n\nLet’s detail the elements of that command to make it clear :
\n--decode
: this flag is waiting for a message type. Our profile is a message in the protocol buffer terminology. Each message has a specific type. In our case, our profile message has the type perftools.profiles.Profile
. This string does not come from anywhere. If you display the first lines of the profile.proto you will see that it makes sense :// github.com/google/pprof/blob/master/proto/profile.proto\n//...\nsyntax = "proto3";\n\npackage perftools.profiles;\n\noption java_package = "com.google.perftools.profiles";\noption java_outer_classname = "ProfileProto";\n\nmessage Profile {\n//...
\nThis is the name of the package then the name of the message : perftools.profiles.Profile
The next argument to the protoc command is the path of the proto file
--proto_path
: this flag is here to indicate to protoc where it can find .proto files. This path has to contain our .proto file. Otherwise, protoc will not be able to do its job.
Then we are passing to protoc the encoded message with < profile.pb
. The data in the profile.pb file will be passed as standard input to the protoc program.
Here is the output of this command :
\nsample_type {\n type: 1\n unit: 2\n}\nsample_type {\n type: 3\n unit: 4\n}\nsample {\n location_id: 1\n location_id: 2\n location_id: 3\n value: 14\n value: 140000000\n}\nsample {\n location_id: 4\n location_id: 2\n location_id: 3\n value: 7\n value: 70000000\n}\nsample {\n location_id: 5\n location_id: 2\n location_id: 3\n value: 1\n value: 10000000\n}\nmapping {\n id: 1\n has_functions: true\n}\nlocation {\n id: 1\n mapping_id: 1\n address: 17482189\n line {\n function_id: 1\n line: 24\n }\n}\nlocation {\n id: 2\n mapping_id: 1\n address: 17482037\n line {\n function_id: 2\n line: 18\n }\n}\nlocation {\n id: 3\n mapping_id: 1\n address: 16946166\n line {\n function_id: 3\n line: 201\n }\n}\nlocation {\n id: 4\n mapping_id: 1\n address: 17482182\n line {\n function_id: 1\n line: 24\n }\n}\nlocation {\n id: 5\n mapping_id: 1\n address: 17482186\n line {\n function_id: 1\n line: 25\n }\n}\nfunction {\n id: 1\n name: 5\n system_name: 5\n filename: 6\n}\nfunction {\n id: 2\n name: 7\n system_name: 7\n filename: 6\n}\nfunction {\n id: 3\n name: 8\n system_name: 8\n filename: 9\n}\nstring_table: ""\nstring_table: "samples"\nstring_table: "count"\nstring_table: "cpu"\nstring_table: "nanoseconds"\nstring_table: "main.doSum"\nstring_table: "/Users/maximilienandile/go/src/go_book/profiling/gettingstarted/main.go"\nstring_table: "main.main"\nstring_table: "runtime.main"\nstring_table: "/usr/local/go/src/runtime/proc.go"\ntime_nanos: 1546864261276935000\nduration_nanos: 417027943\nperiod_type {\n type: 3\n unit: 4\n}\nperiod: 10000000
\nWe can see that this file define :
\nSample types
Samples
Mappings
Locations
Functions
A \" string table\"
The properties time_nanos, a period type, and the property period.
In the next sections, you will understand the usage of those properties.
\n\nA stack is a pile of objects. In real life, you can make a stack with woods, with glasses of champagne, or with everything that can be stacked.
\nIn computer science, we are stacking function calls. When a program executes, it starts with a function. The main function is the first function executed. And then we will call other functions, that will call other functions...
\nWhen your program runs, the call stack will grow... You can get the call stack by using the debug
package.
The call stack of a program is an ordered list of currently running functions.
\n\nLet’s take an example. In the next listing, you can see a sample application that defines a main function and two other functions firstFunctionToBeCalled
and secondFunctionToBeCalled
. The main function calls firstFunctionToBeCalled
that calls secondFunctionToBeCalled
. In this last function, we will print the stack.
// profiling/stack/main.go \npackage main\n\nimport "runtime/debug"\n\nfunc main() {\n firstFunctionToBeCalled()\n}\n\nfunc firstFunctionToBeCalled(){\n secondFunctionToBeCalled()\n}\n\nfunc secondFunctionToBeCalled(){\n debug.PrintStack()\n}
\nThe previous program outputs :
\n$ go run main.go\ngoroutine 1 [running]:\nruntime/debug.Stack(0xc00000e1b0, 0x1, 0x1)\n /usr/local/go/src/runtime/debug/stack.go:24 +0xa7\nruntime/debug.PrintStack()\n /usr/local/go/src/runtime/debug/stack.go:16 +0x22\nmain.secondFunctionToBeCalled()\n /Users/maximilienandile/go/src/go_book/profiling/stack/main.go:14 +0x20\nmain.firstFunctionToBeCalled()\n /Users/maximilienandile/go/src/go_book/profiling/stack/main.go:10 +0x20\nmain.main()\n /Users/maximilienandile/go/src/go_book/profiling/stack/main.go:6 +0x20
\nYou can read the stack from end to start (from main.main
to runtime/debug.Stack
). Everything starts with the main. Then the stack grows...
The call stack is then parsed and used to give sense to the measure that was made by the profiler.
\n\nIn this section, I will try to explain to you what CPU time is. Without a clear understanding of this notion, the next sections will be hard to understand.
\nThe CPU time represents the Central Processing Unit’s time (CPU) to execute the set of instructions defined in your program. The microprocessor handles those instructions. The more your program is complex and makes intensive calculations, the more CPU time you need.
\nWe can split CPU time into two subcategories:
\nCPU user time
CPU system time
Let’s take examples to understand those concepts better.
\n// profiling/understanding-CPU-profile/main.go\npackage main\n\nimport (\n "fmt"\n "log"\n "os"\n "runtime/pprof"\n)\n\nfunc main() {\n // set CPU profiling (1)\n f, err := os.Create("profile.pb.gz")\n if err != nil {\n log.Fatal(err)\n }\n err = pprof.StartCPUProfile(f)\n if err != nil {\n log.Fatal(err)\n }\n defer pprof.StopCPUProfile()\n // CPU intensive operation (2)\n test := 0\n for i := 0; i < 1000000000; i++ {\n test = i\n }\n fmt.Println(test)\n}
\nIn the previous listing, we begin with the standard syntax to start CPU profiling on our program. Then we have developed a for loop that iterates on numbers from 0 to 1000000000 (excluded). Inside the for loop, we are changing the value of a variable test to the value of i (which is the counter variable).
\nLet’s build our program and then launch it :
\n$ go build main.go\n$ ./main\n999999999
\nHow long does the program take to run? We can relaunch it with the time utility (for Mac and Linux users). The command will execute the program and display statistics :
\n$ time ./main\nreal 0m0.629s\nuser 0m0.518s\nsys 0m0.005s
\nHow to interpret this output?
\n0m0.629s : the total time, which is the elapsed time between invocation and program termination
0m0.518s seconds correspond to the user CPU time. This time corresponds to the time the CPU (processor) was busy executing instructions outside the kernel (user space).
0m0.005s seconds correspond to system CPU time. This time measure corresponds to the time taken by the CPU to execute commands in the kernel space, for instance, system calls (opening a file, for instance)
In the figure 1 you can see the repartition between the user and system CPU time.
\nYou can see that system calls take only 1 percent of CPU time. 99% of the CPU time happens on the user space.
\nTo see how we can change those percentages, we can make this time a program that performs a lot of system calls. By chance, we have at our disposal the syscall package :
\n// profiling/system-call/main.go\npackage main\n\nimport (\n "fmt"\n "io/ioutil"\n "log"\n "syscall"\n "time"\n)\n\nfunc main() {\n for i := 0; i < 100; i++ {\n fileName := fmt.Sprintf("/tmp/file%d", time.Now().UnixNano())\n err := ioutil.WriteFile(fileName, []byte("test"), 0644)\n if err != nil {\n log.Fatal(err)\n }\n err = syscall.Chmod(fileName, 0444)\n if err != nil {\n log.Fatal(err)\n }\n }\n}
\nHere we have a for loop that will create 100 files. We are building the path of the file with an fmt.Sprintf
. Each path is /tmp/fileXXX
where XXX represents the number of nano secondes since UNIX epoch time (number of nanoseconds elapsed since 00:00:00 Thursday, 1 January 1970).
Then we are creating the file with the help of ioutil.Writefile
. This util will perform two syscalls (Open and Write). Then when the file is created, we will change its mode to 0444
with the help of the system call Chmod
.
This is a lot of system calls! let’s see what happens to our time statistics :
\n$ go build main.go\n$ time ./main\n\nreal 0m0.031s\nuser 0m0.004s\nsys 0m0.023s
\nThe user line corresponds to only 0.004 seconds, whereas the system line correspond to 0.023 seconds. On figure 2 you can see the repartition visually. CPU time for system calls now represents 85,2% of total CPU time!
\nIn this section, we used the term “kernel”. Kernel refers to the central component of an operating system. The kernel manages the system resources. It also manages the different hardware components of the computer. When we are doing a system call, we use the kernel facilities. For instance, opening a file in Go will trigger a system call that the kernel will handle.
\n\nYou have seen that a profile consists of numerous samples in the decoded profile. The notion of sample can be confusing, so I want to make it clear.
\nA sample is a measurement. This measure is made at a certain time during the profiling process. When we profile a program, we collect measurements, and those measures are materialized in the profile by samples. In figure 3 you can see that a profile is composed of samples that contain :
\na measure
a location and
optional additional information.
The measure is not the same if you have a CPU profile or a memory profile.
\nWhen you activate CPU profiling your program will stop every 100 milliseconds (figure 4).
\nEach time the profiler stops the program :
\nit collects data
the data is parsed, the measure is extracted
A sample is created.
The data collected consists in a call stack. As you have seen in the last section, the trace will give information about the functions called. The CPU time is also measured for each sample.
\nThe pprof tool allows us to visualize the samples as they appear in the profile. To get this data, simply enter the following command in your terminal (you will enter the pprof interactive mode)
\n$ go tool pprof main profile.pb.gz\nFile: main\nType: cpu\nTime: Jan 13, 2019 at 7:07pm (CET)\nDuration: 720.43ms, Total samples = 470ms (65.24%)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof)
\nThen just type “traces” :
\n(pprof) traces\nFile: main\nType: cpu\nTime: Jan 13, 2019 at 7:07pm (CET)\nDuration: 720.43ms, Total samples = 470ms (65.24%)\n-----------+-------------------------------------------------------\n 240ms main.main\n runtime.main\n-----------+-------------------------------------------------------\n 190ms main.main\n runtime.main\n-----------+-------------------------------------------------------\n 30ms runtime.nanotime\n runtime.sysmon\n runtime.mstart1\n runtime.mstart\n-----------+-------------------------------------------------------\n 10ms main.main\n runtime.main\n-----------+-------------------------------------------------------
\nThe profile samples are written in text mode. You can see that in our case, we have four samples. The first column represents the samples’ values, the CPU time. The second column is the trace.
\nWe can see that our profiling took 720.43ms to build and that the total samples account for 470ms, representing 65.24% of 720.43ms). The term duration can be confusing. This is not the program duration but the profile duration. In our case, the two figures are close but not equal.
\nThe total execution time of the program is smaller than the profile duration. That’s because the profiling is started after the program begins, and it’s stoped when the main function exits.
\nThe first sample is the table has a value of 240ms of CPU time. The call stack when the data was collected has a length of 2. The first function to be called is runtime.main the second one to be called is main.main. The values are not cumulated in this view.
\n\nIn this section, we will focus on the pprof command line. As you see in figure 5 the pprof command line is composed of 4 elements :
\nThe output format for the visualization of the results (ex: pdf, SVG, web ...)
The optionsthat can refine the visualization (for instance, the -show flag will represent only nodes that are matching a regular expression)
The path to the program binary
The path to the profile source (which is the protocol buffer file)
At the time of writing, there are 24 output formats available. We will not cover every single format. The more common are -web which will generate an SVG graph and output it on the web browser, -svg and -pdf that will allow you to share your profiles with your colleagues easily. In the next section, we will see some common use cases.
\n\nYou will need to install Graphviz (https://gitlab.com/graphviz/graphviz) on your machine to run the following command.
\nGraphviz is available on homebrew for MacOS users : brew install graphviz
For Linux users : apt-get install graphviz
For windows, users’ installation instructions are on the official website.
For Mac users, you might have to define first the default application to open SVG. On my computer, for instance, the default app was Latex. If you are in this kind of situation, you can change the default application to open SVG to your favorite browser. Right-click on an SVG file and then click on “Get Info”. The view opened lets you choose the default application to use for open. Select the one you want and then valid the modification.
\n\nTo generate an SVG file from your profile and open it in your web browser, you can use the web flag :
\ngo tool pprof -web gettingstarted profile.pb
\nThis will create an SVG file and open your browser to display it. The graph in figure 6 will appear.
\nLet’s detail what is on this graph (take a look at the figure 7)
\nYou can see in the figure 7 that the SVG produced is composed of two main parts.
\nThe first part (a rectangular box) holds the details about the profile :
\nThe name of the binary profiled
The type of profile
The time on which the profile was generated
The time is taken by the profile (duration)
An indication of the total samples, which is here returned as a duration, can be confusing.
The indication about the representativity of the current view:
\nlabelised \"Showing nodes account for X, Y% of Z.
X represents the number of samples that the nodes displayed represent.
Z the total samples available in the profile
Y represents the percentage X/Z %
Then on this image, you can see blocks called nodes and arrows. We have a directed graph (which also acyclic, meaning that it does not contain cycles) if we use the correct terminology. Each node in this graph represents a function call. In the figure, we detail what each element of a node means :
\nYou got the package’s information, the function called, the sample’s value, and the profile’s total samples on each node. This last statistic is very important. With this stat, you can compare each node’s weight and see where the performance drop is happening.
\n\nThe profile information generated can be inspected into a nice web interface. To launch it, type :
\n$ go tool pprof -http localhost:9898 yourBinaryName profile.pb
\nIt will launch a web server on http:localhost:9898, where you can see the statistics.
\nThe weblist (8) is an interesting view. It allows you to see your application’s source code with the profiling statistics. In addition to that, you get the disassembled version of the source code profiled!
\n\nWhen you compile your Go program, you get what it’s called a called “executable”. You can execute with your computer an executable file. An executable is written in machine language. Machine language is not impossible to read, but it’s really difficult. A disassembler is a program that can transform machine language into assembly language. Assembly language is not Go code. It’s a set of instructions that are very close to the architecture of the computer (processor) but also to its operating system.
\nBy visualizing the assembly code produced, you can better understand what is going on under the hood. Assembly is not very popular among the developer community. If we take the Stackoverflow developer survey of 2018, only 7,8% of our colleagues say that they use it, for “professional” developers the figure is lower : 6.8%
This is the process of “modifying a software system to make some aspect of it work more efficiently or use fewer resources.”
The main objective of code optimization is to reduce the execution time of program
To optimize a code, you need to modify it, and sometimes you will need to make important modifications to gain performance. Chang, Mahlke, and Hwu, in a paper published in 1991, made an interesting distinction between two types of code optimization :
\nThe first one is a modification of portions of code that reduce this particular portion’s execution time without modifying the execution time of any other instructions in the program. For instance, deleting some old code will reduce the execution time of the portion you have optimized; this modification will not impact the other instructions
The second one is the optimization of portions of code that will increase the execution time of other portions of code. Let’s take an example, imagine that you have a for loop, and inside this loop, you are making a useless computation. You can extract this computation outside the loop. If you do so, you will reduce the loop’s execution time, but you will increase the execution time outside the loop.
Code optimization is mostly acquired with experience. But there are some common optimization techniques that you should know by heart!
\n\nIntroduction of dead code is not easy with go. The go compiler will block if you are declaring a variable without using it. That’s a very good thing. But it does not mean that you cannot introduce some dead code fragments.
\nHere is an example :
\n// profiling/classic-opti/deadCode/main.go\npackage main\n\nimport "fmt"\n\nfunc main() {\n fmt.Println(multiply(1, 9))\n}\n\nfunc multiply(x, y int) int {\n a := x + 1\n b := a + 2*y\n b = b * b\n return x * y\n}
\nHere we have the function multiply that takes two integers (x and y). This function will multiply both integers and return the result.
\nIn the body of this function, we define a
and b
. Those two variables are useless and are maybe the heritage of old computation that was done before. Do not be sentimental delete those useless lines:
// profiling/classic-opti/deadCodeV2/main.go \n//...\nfunc multiply(x,y int) int {\n return x * y\n}
\n\nThis optimization technique is very easy to put into practice. If you are confronted with a new code, examine the loops and their exit conditions closely.
\nLoops are often used to iterate over arrays or slice to find a specific element when you don’t know its index. Search for useless iterations: when you find what you are looking for, there is no need to continue the loop!
\nLet’s take an example. Imagine that your job is to find an element in a slice and report its index. You might produce this program
\n// profiling/classic-opti/loopExit/main.go \n//...\n\n// s is a slice of uint32\nvar search uint32 = 4285020338\nvar foundIndex int\nvar found bool\nfor j := 0; j < len(s); j++ {\n if s[j] == search {\n found = true\n foundIndex = j\n }\n}\nif found {\n fmt.Printf("found index is %d", foundIndex)\n} else {\n fmt.Println("index not found")\n}
\nIn this listing we begin by the definition of 3 variables : search
(which will hold the number we are trying to locate in the slice s
), foundIndex
that will hold the index of the retrieved element and “found”, which is a boolean true if the number has been found.
Then take a close look at the for loop... We are iterating over all the elements of the slice. At each iteration, we check if the element located at the current index is equal to the integer that we look for. If so, we are setting the variables found and foundIndex
ex to their expected values.
What is the interest in continuing the loop if we find the element? We will make some additional comparisons that will cost some time. We can exit the loop directly if we found the element :
\n// profiling/classic-opti/loopExitV2/main.go \n//...\n\nfor j := 0; j < len(s); j++ {\n if s[j] == search {\n found = true\n foundIndex = j\n break\n }\n}
\nWe use the break
statement that will end the for loop.
This optimization technique consists of moving invariant instructions outside the loops. Those instructions have a source operand that does not change within the loop
To make this clear, we will take an example. The program we have to create must compute the total turnover of a chain of stores. The stores transmit to the holding their turnover figures. For each store costs have to be deducted from the turnover figure transmitted :
\n// profiling/classic-opti/loopInvariant/main.go \n// ...\nvar turnover uint32\nfor i := 0; i < len(s); i++ {\n var costs = 950722 + 12*uint32(len(s))\n turnover = turnover + s[i] - costs\n}\nfmt.Println(turnover)
\nWe can note that inside the for loop, we are defining at each iteration the costs variable. This variable is set with the result of a complex computation :
\nvar costs = 950722 + 12*uint32(len(s))
\nThe cost computation is still the same for each store. It does not depend on the turnover of the stores. The source operand (950722 + 12*uint32(len(s))
) does not change; it’s invariant. We can extract this computation outside this for loop. The program will execute it only one time, and therefore we will spare some CPU time :
// profiling/classic-opti/loopInvariantV2/main.go \n//...\n\nvar turnover uint32\nvar costs = 950722 + 12*uint32(len(s))\nfor i := 0; i < len(s); i++ {\n turnover = turnover + s[i] - costs\n}\nfmt.Println(turnover)
\n\nIf your program is composed of two loops (or more) that :
\nare executed under the same conditions
are independent (the execution of one loop do not depend on the execution of the other one)
have the same number of executions
We can merge those loops into one single loop
// profiling/classic-opti/loopFusion/main.go \n//...\n\nvar costs float64\nfor i := 0; i < len(s); i++ {\n costs += 0.2*float64(s[i])\n}\nvar turnover uint32\nfor i := 0; i < len(s); i++ {\n turnover += s[i]\n}
\nThose two loops have the same number of iterations, they are independent, and they are both executed. We can merge them :
\n// profiling/classic-opti/loopFusionV2/main.go \n//...\n\n\nvar costs float64\nvar turnover uint32\nfor i := 0; i < len(s); i++ {\n costs += 0.2*float64(s[i])\n turnover += s[i]\n}
\nWe will spare CPU time (instead of looping over the slice two times, we only loop through the values of s one single time).
\n\nConstant folding is a compiler optimization technique that consists of evaluating the values of constants during the compilation of the program and not during its execution.
\nThe compiler already does this for us. But it’s useful to remember! If your program defines a variable to hold the result of an operation, think about the opportunity to create a constant instead. The operation will be performed during the compilation, and therefore you will spare precious CPU time.
\nLet’s take the example of an upload function that is responsible for transferring files from one server to another one :
\n// profiling/classic-opti/constantFolding/main.go \n// ...\n\nfunc upload(files []File)(error){\n uploadLimit := 10*2048/2\n for _,file := range files{\n if file.Size > uploadLimit {\n return errors.New("the upload limit has been reached")\n }\n // upload the file\n }\n return nil\n}
\nHere the variable uploadLimit is set to: 10*2048/2
The problem here is that each time we are going to upload a file to the server, we will compute again the uploadLimit, which is a constant. We can replace it with a constant to make Go compute it when the program is compiled.
\n// profiling/classic-opti/constantFoldingV2/main.go \n// ....\n\nconst uploadLimit =10*2048/2\n\nfunc upload(files []File)(error){\n for _,file := range files{\n if file.Size > uploadLimit {\n return errors.New("the upload limit has been reached")\n }\n // upload the file\n }\n return nil\n}
\n\nWe have talked a lot about profiling in the previous sections; it’s time to apply what we have learned to a real-world optimization problem.
\n\nImagine you are a developer in a large worldwide hotel group. Your company has 30.000 hotels around the world. You are asked to build a program to compute the global group turnover. Each hotel is sending to the central office a report at the end of each month. The reporting is a JSON file. Here is an example :
\n{\n "name": "The Ultra Big Luxury Hotel",\n "reservations": [\n {\n "id": 0,\n "duration": 1,\n "rateId": 0\n },\n {\n "id": 1,\n "duration": 5,\n "rateId": 1\n }\n ],\n "rates": [\n {\n "id": 0,\n "price": 300\n },\n {\n "id": 1,\n "price": 244\n }\n ]\n}
\nThis is the reporting of the “The Ultra Big Luxury Hotel”. The hotel has two reservations. Each reservation have an internal id; a duration expressed in night and a rateId. Each hotel maintains its rate list. The reservation number 0 has a duration of 1 day with the rateId 0, which means 300$ per night. The second reservation has a duration of 5 days for a rate equal to 244$/night.
\nThe reportings are aggregated into a big JSON file. We have to develop a program that will read the JSON file and then compute the whole group’s total turnover.
\n// profiling/program-optimization/cmd/main.go \n// ...\n\ntype Reporting struct {\n HotelReportings []HotelReporting\n ExchangeRates []ExchangeRate\n}\n\ntype HotelReporting struct {\n HotelId int\n HotelName string\n Reservations []Reservation\n Rates []Rate\n}\n\ntype ExchangeRate struct {\n RateUSD float64\n HotelId int\n}\ntype Reservation struct {\n Id uint\n Duration uint\n RateId uint\n}\ntype Rate struct {\n Id uint\n Price uint\n}
\nWe first create our type struct to unmarshal the JSON file :
\ndata, err := ioutil.ReadFile("/Users/maximilienandile/go/src/go_book/profiling/programOptimization/cmd/output.json")\nif err != nil {\n panic(err)\n}\nreportings := make([]Reporting, 0)\nerr = json.Unmarshal(data, &reportings)\nif err != nil {\n panic(err)\n}
\nWe then load the file and unmarshal the JSON into reportings which is a slice of Reporting values. We will then implement the main logic :
\nvar groupTurnover float64\nfor _, hotelReport := range r.HotelReportings {\n turnover, err := getTurnover(hotelReport,r.ExchangeRates)\n if err != nil {\n panic(err)\n }\n groupTurnover += turnover\n}\nfmt.Printf("Group Turnover %f\\n", groupTurnover)
\nWe iterate over the reportings, and then for each reporting, we will call the function getTurnover that will compute the turnover for a particular hotel reporting.
\nfunc getTurnover(reporting HotelReporting, exchangeRates []ExchangeRate) (float64, error) {\n turnoverPerReservation := []float64{}\n for _, reservation := range reporting.Reservations {\n ratePerNight, err := getRatePerNight(reporting.Rates, reservation.RateId)\n if err != nil {\n panic(err)\n }\n xr,err := getExchangeRate(reporting.HotelId,exchangeRates)\n if err != nil {\n panic(err)\n }\n turnoverResaUSD := float64(ratePerNight*reservation.Duration)*xr\n\n turnoverPerReservation = append(turnoverPerReservation, turnoverResaUSD)\n\n }\n return computeTotalTurnover(turnoverPerReservation), nil\n}\n\nfunc computeTotalTurnover(turnoverPerReservation []float64) float64 {\n var totalTurnover float64\n for _, t := range turnoverPerReservation {\n totalTurnover += t\n }\n return totalTurnover\n}\n\nfunc getRatePerNight(rates []Rate, rateId uint) (uint, error) {\n var found bool\n var price uint\n for _, rate := range rates {\n if rate.Id == rateId {\n found = true\n price = rate.Price\n }\n }\n if found {\n return price, nil\n } else {\n return 0, errors.New("Impossible to retrieve rate per night")\n }\n}\n\nfunc getExchangeRate(hotelId int, exchangeRates []ExchangeRate) (float64, error) {\n var found bool\n var rate float64\n for _, xr := range exchangeRates {\n if xr.HotelId == hotelId {\n found = true\n rate = xr.RateUSD\n }\n }\n if found {\n return rate, nil\n } else {\n return 0, errors.New("Impossible to retrieve exchange rate")\n }\n}
\nIn the previous listing, we have our three main functions that will compute the turnover :
\ngetTurnover
function will create a slice holding the turnover for each reservation of the hotel. To get this turnover, the function will need to retrieve the rate per night applied by the hotel (it will call getRatePerNight
) and the hotel currency exchange rate to convert the amount to USD (getExchangeRate
).
Then it will multiply the number of nights (duration) by the rate to get the amount of money generated by the reservation. This amount is in local money; the program will multiply it by the exchange rate USD.
The next step is, to sum up, the amounts that are in the turnoverPerReservation
slice. To do so, we call computeTotalTurnover
.
To test our program, we have generated fake data :
\n5.000 hotels
70 rates per hotel
1.000 reservations per hotel.
Let’s build it :
\n$ go build main.go
\nAnd then we can run it (with the time utility)
\n$ time ./main --rates 70 --hotels 5000 --resas 1000\nreal 0m27.659s\nuser 0m27.423s\nsys 0m0.124s
\nThe program tool 27.659s to execute (data sample generation time is included)
\n\nWhen the program has been run, the profile has been written to the gzipped file named profile.pb.gz. We can launch pprof to see the profiling data :
\n$ go tool pprof main profile.pb.gz\nFile: main\nType: cpu\nTime: Jan 15, 2019 at 1:42pm (CET)\nDuration: 27.18s, Total samples = 25.72s (94.63%)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof)
\nThe last command will open the interactive mode. You can see that the samples collected cover 90.81% of the profile duration. We will begin by displaying the top 10 functions regarding CPU time.
\n\n(pprof) top\nShowing nodes accounting for 25.58s, 99.46% of 25.72s total\nDropped 25 nodes (cum <= 0.13s)\nShowing top 10 nodes out of 23\n flat flat% sum% cum cum%\n 16.87s 65.59% 65.59% 16.87s 65.59% main.getExchangeRate\n 6.15s 23.91% 89.50% 6.15s 23.91% runtime.memclrNoHeapPointers\n 2.03s 7.89% 97.40% 2.03s 7.89% runtime.nanotime\n 0.24s 0.93% 98.33% 0.24s 0.93% main.getRatePerNight\n 0.21s 0.82% 99.14% 0.21s 0.82% runtime.(*mspan).init (inline)\n 0.07s 0.27% 99.42% 23.62s 91.84% main.getTurnover\n 0.01s 0.039% 99.46% 6.44s 25.04% runtime.growslice\n 0 0% 99.46% 23.63s 91.87% main.main\n 0 0% 99.46% 6.42s 24.96% runtime.(*mcache).nextFree\n 0 0% 99.46% 6.42s 24.96% runtime.(*mcache).nextFree.func1
\nOn the figure 9 you can see explanations about the output of the command. Keep in mind the signification of cum and flat.
\nThe flat time of a function correspond to the sum of CPU time that was spent during the function execution during the profiling of the program.
\nstands for cumulative. This is the sum of CPU time taken by the execution of the function and also, the time is taken by the function when it’s waiting for another function to return.
\nLet’s take an example. We can display all the traces that contain the function main.getTurnover with the following command (using the pprof interactive mode)
\n(pprof) traces focus main.getTurnover
\nIn the following section, you can see an extract of 3 traces that are displayed.
\n-----------+-------------------------------------------------------\n 1.58s runtime.memmove\n runtime.growslice\n main.getTurnover\n main.main\n runtime.main\n-----------+-------------------------------------------------------\n 70ms runtime.memclrNoHeapPointers\n runtime.growslice\n main.getTurnover\n main.main\n runtime.main\n-----------+-------------------------------------------------------\n 20ms main.getTurnover\n main.main\n runtime.main
\nThe first trace account for 1.58s.
\nThe function main.getTurnover
is in the third position (starting from the beginning of the call stack). This trace says that the main (main.main
) function has called the function main.getTurnover
that has called the function runtime.growslice
.… In this sample, the function getTurnover is waiting for the function runtime.grow to return (which is waiting for runtime.memove to return). We can add this to the cumulative duration.
For the second call stack the function getTurnover
is also part of the call stack, also in the third position. We have to add the duration to the cumulative duration.
For the third call stack we have our function at the top of the stack. We can add this to the flat duration. Let’s do some math :
\nIf we add only those call stacks to generate our profile
\nthe flat time of function main.getTurnover
would be equal to 20ms
whereas the cumulative time of main.getTurnover
would be equal to:
1.58s+70ms=1.58s+0.07s=1.65s
\nIf we analyze all the profile traces, we can make the same observations.
\n\nWe will focus for this first round on the function main.getExchangeRate
which totalizes 16.87s of the profile recorded CPU time. It represents 65.59% of the cumulated CPU time. We have to improve this function :
func getExchangeRate(hotelId int, exchangeRates []ExchangeRate) (float64, error) {\n var found bool\n var rate float64\n for _, xr := range exchangeRates {\n if xr.HotelId == hotelId {\n found = true\n rate = xr.RateUSD\n }\n }\n if found {\n return rate, nil\n } else {\n return 0, errors.New("Impossible to retrieve exchange rate")\n }\n}
\nThis function will iterate over a slice named exchangeRates
which contains all the currency exchange rates for each hotel.
Each hotel has an exchange rate. In our sample, there are 5.000 hotels. Each time this function is called, the for loop will iterate over the 5.000 rates to find the right one.
Imagine that we are looking for the exchange rate for the hotel of id 42. If the slice elements are ordered by hotelId, at the 42nd iteration, we are done, the exchange rate has been found.
But our function will keep iterating over the slice. That’s 5,000-42=4958 useless iterations.
We could stop the iteration when we found our exchange rate :
\nif xr.HotelId == hotelId {\n found = true\n rate = xr.RateUSD\n break\n}
\nHere the keyword break will exit the for loop and continue the execution from the closing bracket of the for.
\nLet’s run the profile again to see if it has improved our performance.
\n\nreal 0m7.690s\nuser 0m7.498s\nsys 0m0.085s
\nImpressive ! with this simple break statement, we have gained 27-7=20 seconds!
\n\nLet’s see the effect on the profile statistics :
\nShowing nodes accounting for 6.70s, 99.41% of 6.74s total\nDropped 5 nodes (cum <= 0.03s)\nShowing top 10 nodes out of 25\n flat flat% sum% cum cum%\n 4.29s 63.65% 63.65% 4.29s 63.65% main.getExchangeRate\n 1.48s 21.96% 85.61% 1.48s 21.96% runtime.memmove\n 0.57s 8.46% 94.07% 0.57s 8.46% runtime.nanotime\n 0.18s 2.67% 96.74% 0.18s 2.67% main.getRatePerNight\n 0.10s 1.48% 98.22% 0.10s 1.48% runtime.(*mspan).init (inline)\n 0.06s 0.89% 99.11% 0.06s 0.89% runtime.memclrNoHeapPointers\n 0.02s 0.3% 99.41% 6.15s 91.25% main.getTurnover\n 0 0% 99.41% 6.15s 91.25% main.main\n 0 0% 99.41% 0.18s 2.67% runtime.(*mcache).nextFree\n 0 0% 99.41% 0.18s 2.67% runtime.(*mcache).nextFree.func1
\nThe function main.getExchangeRate
is still at the top of the CPU profile. Its flat execution time is now just 4.29s, compared to 16.87s in the previous version; this is an improvement!
The previous CPU profile show that the function main.getExchangeRate
is still one of the most consuming CPU time. Let’s see in which context it’s used :
func getTurnover(reporting HotelReporting, exchangeRates []ExchangeRate) (float64, error) {\n turnoverPerReservation := []float64{}\n for _, reservation := range reporting.Reservations {\n //..\n xr,err := getExchangeRate(reporting.HotelId,exchangeRates)\n //...\n }\n //...\n}
\nYou can see that the function getTurnover
will retrieve the turnover for a particular hotel.
It will loop over all the hotel reservations.
And for each hotel reservation, the function will retrieve the hotel exchange rate. That’s not very optimal because the exchange rate does not vary. It is stable for each reservation because it’s defined for each hotel.
In this code snippet, we have a perfect example of a loop invariant. the variable xr
will always be the same, but we keep retrieving it for each loop iteration.
We can simply extract the call to getExchangeRate
from the for loop. We will place it before the loop begins :
func getTurnover(reporting HotelReporting, exchangeRates []ExchangeRate) (float64, error) {\n turnoverPerReservation := []float64{}\n xr,err := getExchangeRate(reporting.HotelId,exchangeRates)\n for _, reservation := range reporting.Reservations {\n //..\n //...\n }\n //...\n}
\nLet’s see the impact of this simple modification on our program’s performance.
\n$ go build main\n$ time ./main --rates 70 --hotels 20000 --resas 1000
\nThe result is impressive :
\nreal 0m1.090s\nuser 0m0.847s\nsys 0m0.072s
\nThe program takes no 1.090s to execute! Let’s see the results of profiling :
\n$ go tool pprof main profile_1547640425.pb.gz\nFile: main\nType: cpu\nTime: Jan 16, 2019 at 1:07pm (CET)\nDuration: 614.36ms, Total samples = 400ms (65.11%)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof)
\nFirst, we note that the samples account for only 400ms (approximately four samples). This is perfectly normal because our program has only one second execution time. The profiler has extracted only four samples, which account for 65.11% of the profile duration.
\nWe want to improve that. The solution is to generate more sample data. We will multiply the number of hotels by 4 (now 20.000) to increase the number of samples. I have a simple script that generates fake data; I intentionally hide it for you to concentrate on just code optimization.
\nAgain we build the program, and we launch it to get the profile.
\n$ go tool pprof main profile_1547641211.pb.gz\nFile: main\nType: cpu\nTime: Jan 16, 2019 at 1:20pm (CET)\nDuration: 2.06s, Total samples = 1.86s (90.28%)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof)
\nThe samples account now for 90.28% of the profile duration! The sum of sample time is now 1.86 seconds. The number of samples is now 24! The data of our profile is more accurate. Let’s see the top view :
\n(pprof) top\nShowing nodes accounting for 1.83s, 98.39% of 1.86s total\nShowing top 10 nodes out of 45\n flat flat% sum% cum cum%\n 1.48s 79.57% 79.57% 1.48s 79.57% runtime.memmove\n 0.12s 6.45% 86.02% 0.12s 6.45% main.getRatePerNight\n 0.06s 3.23% 89.25% 0.06s 3.23% runtime.(*mspan).refillAllocCache\n 0.05s 2.69% 91.94% 0.05s 2.69% runtime.memclrNoHeapPointers\n 0.04s 2.15% 94.09% 0.04s 2.15% runtime.(*mspan).init (inline)\n 0.02s 1.08% 95.16% 0.02s 1.08% main.getExchangeRate\n 0.02s 1.08% 96.24% 1.81s 97.31% main.getTurnover\n 0.02s 1.08% 97.31% 0.02s 1.08% runtime.mmap\n 0.01s 0.54% 97.85% 0.01s 0.54% main.computeTotalTurnover\n 0.01s 0.54% 98.39% 0.01s 0.54% runtime.findrunnable
\n\nIn the previous top view, we can note that the function main.getRatePerNight
is the second top consumer of CPU time.
Let’s analyze this function :
\nfunc getRatePerNight(rates []Rate, rateId uint) (uint, error) {\n var found bool\n var price uint\n for _, rate := range rates {\n if rate.Id == rateId {\n found = true\n price = rate.Price\n }\n }\n if found {\n return price, nil\n } else {\n return 0, errors.New("Impossible to retrieve rate per night")\n }\n}
\nHere we are searching for the rate of the night.
For each hotel, the rates are transmitted in the reporting.
Each rate got an id
and each reservation has a rateId
, allowing us to retrieve the rate applied to the reservation.
The function consists of a for loop that will iterate over all the hotel rates to find the right one.
We have here a first optimization that we can make; we have to stop the loop when we found the rate. (see section [subsec:opti-loop-exit])
We just add a break statement :
if rate.Id == rateId {\n found = true\n price = rate.Price\n break\n}
\nWith this improvement, the program takes now 3.088s to execute (compared to 3.097s before) :
\nWe can improve the search also.
\nLet denote n the number of rates for a hotel. In the worst case, we have to perform n iterations to find the rate.
\nWe can use a map to improve the lookup time. That will require an important modification of the code. First, we need to create a function to generate a map of rates. \\text{rateId}\\rightarrow\\text{price}
\nThe map will associate a rateId
with a price :
func RateMap(reporting HotelReporting)map[uint]uint{\n m := make(map[uint]uint)\n for _,rate := range reporting.Rates{\n m[rate.Id] = rate.Price\n }\n return m\n}
\nThe previous function will take as argument an HotelReporting
and return a map
(keys and values are unsigned integers). We will iterate on the hotel reporting rates and fill the map progressively. ,
Next we have to modify the function getRatePerNight
. It will now look into the map to retrieve the price per night :
func getRatePerNight(rateMap map[uint]uint, rateId uint) (uint, error) {\n price, found := rateMap[rateId]\n if found {\n return price, nil\n } else {\n return 0, errors.New("Impossible to retrieve rate per night")\n }\n}
\nThe next modification consists of calling this map generator inside our main function. For each hotel report, we will build our map and then pass it to the getTurnover
function :
func main(){\n //...\n for _, hotelReport := range r.HotelReportings {\n rateMap := RateMap(hotelReport)\n turnover, err := getTurnover(hotelReport,r.ExchangeRates,rateMap)\n if err != nil {\n panic(err)\n }\n groupTurnover += turnover\n }\n fmt.Printf("Group Turnover %f\\n", groupTurnover)\n}
\nBy the way, the function getTurnover
needs also to be modified. It must accept a third parameter: the rate map :
func getTurnover(reporting HotelReporting, exchangeRates []ExchangeRate, rateMap map[uint]uint) (float64, error) {\n //...\n for _, reservation := range reporting.Reservations {\n ratePerNight, err := getRatePerNight(rateMap, reservation.RateId)\n //...\n }\n //...\n}
\nWe have finished. We can now compile, profile and time the execution :
\n$ go build main.go\n$ time ./main --rates 70 --hotels 20000 --resas 1000\nreal 0m3.242s\nuser 0m2.774s\nsys 0m0.321s
\nThat’s not very expected. The total execution time is bigger than with the previous version! We have multiplied it by 3. What’s wrong? This drop in performance is explained by the fact that creating a map generates an overhead (the runtime has to compute the hashes and store them).
\nNumber of rates | \n70 rates | \n\n | 600 rates | \n\n | 10000 rates | \n
---|---|---|---|---|---|
Program version | \nv2 | \nv3 (map) | \nv2 | \nv3 (map) | \nv3 (map) | \n
Total Execution time | \n1.090s | \n3.242s | \n7.698s | \n4.209s | \n26.628s | \n
CPU User time | \n0.847s | \n2.774s | \n7.493s | \n3.705s | \n23.218s | \n
We have only 70 rates in our data sample. Let’s test our program again with 600 rates. You can see in the table 1 that when the number of rates increases, the relative performance of v3 is increasing.
\nLet’s take a look at the top consumers of CPU time with pprof (for 10.000 rates):
\nShowing nodes accounting for 12.02s, 94.50% of 12.72s total\nDropped 51 nodes (cum <= 0.06s)\nShowing top 10 nodes out of 36\n flat flat% sum% cum cum%\n 4.51s 35.46% 35.46% 9.71s 76.34% runtime.mapassign_fast64\n 4.19s 32.94% 68.40% 4.24s 33.33% runtime.(*hmap).newoverflow\n 1.53s 12.03% 80.42% 1.53s 12.03% runtime.memclrNoHeapPointers\n 0.57s 4.48% 84.91% 0.57s 4.48% runtime.aeshash64\n 0.54s 4.25% 89.15% 11.86s 93.24% main.RateMap\n 0.21s 1.65% 90.80% 0.26s 2.04% runtime.mapaccess2_fast64\n 0.13s 1.02% 91.82% 0.62s 4.87% main.getTurnover\n 0.13s 1.02% 92.85% 0.13s 1.02% runtime.nanotime\n 0.11s 0.86% 93.71% 0.14s 1.10% runtime.overLoadFactor (inline)\n 0.10s 0.79% 94.50% 0.10s 0.79% main.getExchangeRate
\nYou can see in the pprof top view that the function runtime.mapassign_fast64
takes much of the total CPU time (76.34% of cumulative time). It means that the step corresponding to filling the map with key-value pairs is CPU intensive, but we have still spared a lot of CPU time by implementing them.
If you take a close look at the code of our program, you can see that we have a useless function : computeTotalTurnover
. This function takes a slice of float64
and return the sum of the element that are into that slice:
func getTurnover(reporting HotelReporting, exchangeRates []ExchangeRate, rateMap map[uint]uint) (float64, error) {\n\n turnoverPerReservation := []float64{}\n // is this slice necessary\n //... for loop\n return computeTotalTurnover(turnoverPerReservation), nil\n // this function call is useless\n}
\nWe do not need a function to do that! The function computeTotalTurnover
can be updated to the following version :
func getTurnover(reporting HotelReporting, exchangeRates []ExchangeRate, rateMap map[uint]uint) (float64, error) {\n var totalTurnover float64\n xr,err := getExchangeRate(reporting.HotelId,exchangeRates)\n if err != nil {\n panic(err)\n }\n for _, reservation := range reporting.Reservations {\n ratePerNight, err := getRatePerNight(rateMap, reservation.RateId)\n if err != nil {\n panic(err)\n }\n totalTurnover += float64(ratePerNight*reservation.Duration)*xr\n }\n return totalTurnover, nil\n}
\nHere we start by defining the variable totalTurnover
(of type float64
) that will hold the turnover for a whole hotel reporting. Instead of putting the partial turnover for a reservation into a slice, and at the end sum the slice we just add the turnover to totalTurnover
.
Let’s run this version :
\n$ go build main.go\n$ time ./main --rates 600 --hotels 20000 --resas 1000 --profileName 600rates\nreal 4.14s\nuser 3.58s\nsys 0.44s
\nThis small improvement has a small (but not negigable impact on performance. The version 3 of the program needed 3.705 seconds (CPU user time) the new version takes now just 3.58s.
\n\n\n | v0 | \nv1 | \nv2 | \nv3 | \nv4 | \n
---|---|---|---|---|---|
Total Execution time | \n463.21s | \n136.60s | \n16.51s | \n4.79s | \n4.57s | \n
CPU User time | \n460.93s | \n135.12s | \n15.83s | \n4.13s | \n3.91s | \n
CPU System time | \n1.92s | \n1.02s | \n0.57s | \n0.51s | \n0.55s | \n
In the table [2](#tab:Performance-inprovement-(600){reference-type=“ref” reference=“tab:Performance-inprovement-(600”} you can see that I have detailed the execution time and the CPU time of each version of the program.
\nOur efforts paid because each round of optimization has improved execution time and CPU time. From the initial version to the final version (which we can still improve), the CPU User time has been divided by approximately 115.
\n\nWe can see in the “top” table that most of the time is spent on the function runtime.memmove
. If we take a look at the source code of the Go runtime, this function is used to move bytes of memory from one location to another location. You can also note that in the top CPU usage, other functions of the runtime that deal with memory are called : runtime.(*mspan).refillAllocCache
and runtime.memclrNoHeapPointers
.
We can infer that we have messed up with memory somewhere in our program... We have not yet done a memory profiling, but we can detect a problem with memory consumption.
\nLet’s see our three functions headers :
\nfunc getTurnover(reporting HotelReporting) (uint, error)\n\nfunc computeTotalTurnover(turnoverPerReservation []uint) uint\n\nfunc getRatePerNight(rates []Rate, rateId uint) (uint, error)
\nHere we pass variables by value and not by reference. We are copying the data during the execution of the program. A simple optimization is to pass variables by reference.
\n\nA heap profile allows the developer to understand the program’s memory usage precisely. So a heap profile will collect data about allocation and deallocation points in your program. The profiler will collect the stack trace of your running program and precise memory statistics at a specific rate.
\nBy the way, you can also collect at any part of your program the current memory statistics by calling :
\nmemStats = new(runtime.MemStats)\nruntime.ReadMemStats(memStats)\nfmt.Println("cumulative count of heap objects allocated: %d",memStats.Mallocs)\n//...
\n\nWe will have to change our application’s code to record the memory consumption. We will add the following lines to ask the pprof package to save a memory profile of our application :
\nf, err := os.Create("mem_profile.pb.gz")\nif err != nil {\n log.Fatal(err)\n}\nruntime.GC() // get up-to-date statistics\nif err := pprof.WriteHeapProfile(f); err != nil {\n log.Fatal("memory profile cannot be gathered", err)\n}\ndefer f.Close()
\nThe first thing to do is to create a file that will hold our profile data. We will use the convenient method os.Create
.
Then we will ask the runtime to run a garbage collection (this will block the execution till the collection is completed). Why? Because we want our memory statistics to be more precise. The Garbage collector will get rid of the memory previously allocated and not currently used. Therefore this unused memory will not appear in our statistics.
\nThen we launch the function WriteHeapProfile
that will write the profile to the provided file.
The memory profile generated by the pprof package will record the live memory allocations “as of the most recently completed garbage collection”
If the runtime ran the garbage collection process in the program three times, the data reported into the profile would concern the memory allocations between the second GC run and the third.
\nIf there is no garbage collection during the program, the profile will contain data about all the allocations.
\nTo disable the GC, you can add this line in your go program :
\ndebug.SetGCPercent(-1)
\n\nLike CPU profiling, memory profiling happens at a certain rate.
\n\nLet’s build our modified program and launch it :
\n$ go build main.go\n$ ./main --rates 600 --hotels 20000 --resas 1000
\nOur memory profile has been written to disk. We can now analyze our profile with the pprof command line :
\n$ go tool pprof main mem_profile.pb.gz\nFile: main\nType: inuse_space\nTime: Jan 21, 2019 at 8:49pm (CET)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof) top\nShowing nodes accounting for 791.52MB, 100% of 791.52MB total\n flat flat% sum% cum cum%\n 791.52MB 100% 100% 791.52MB 100% main.data\n 0 0% 100% 791.52MB 100% main.main\n 0 0% 100% 791.52MB 100% runtime.main
\nFirst, we can note that 791.52MB allocations were profiled.
\nBy entering the top command, we see the top 10 nodes with the biggest size in our profile. Here we have only three nodes! That’s the only node of our profile (that explains the 100%).
\nThen we can note that the type of the profile is inuse_space
.
By default the inuse_space
type of statistic is displayed, but that’s not the only type of memory indicators available :
Memory space that has been allocated but not freed. This is the memory used at the time of sampling
\nTo get the profile in that mode you can type the following command :
\n$ go tool pprof -sample_index=inuse_space main profile.pb.gz
\nalloc_space
Memory space that is allocated now + memory that has been deallocated.$ go tool pprof -sample_index=alloc_space main profile.pb.gz
\ninuse_objects
Will display the number of objects that are allocated but not freed.$ go tool pprof -sample_index=inuse_objects main profile.pb.gz
\nalloc_objects
This is a counter of objects that are allocated + all the objects that have been freed.$ go tool pprof -sample_index=alloc_space main profile.pb.gz
\nYou have to keep in mind that those statistics are all stored in the same profile. There are no four types of memory profiles, but just one, which is called the heap profile.
\n\nEach type of statistics will collect a different kind of information :
\nThe inuse statistics will give a view of memory used at a certain point of time during the program execution
The alloc statistics is useful to see which part of the program allocate the most memory (because it also takes into consideration the memory that was freed by the garbage collector)
What we can see in the first profile that we have generated is that memory is used at 100% in the function main.data. This function has been built to generate the data we use in our program.
\nThat’s not a surprise. Let’s see alloc_space statistics.
\n$ go tool pprof -sample_index=alloc_space main profile.pb.gz\nFile: main\nType: alloc_space\nTime: Jan 21, 2019 at 9:10pm (CET)\nEntering interactive mode (type "help" for commands, "o" for options)\n(pprof) top\nShowing nodes accounting for 1.53GB, 100% of 1.53GB total\n flat flat% sum% cum cum%\n 1.53GB 100% 100% 1.53GB 100% main.data\n 0 0% 100% 1.53GB 100% main.main\n 0 0% 100% 1.53GB 100% runtime.main
\nThis profile gives information about our program’s main parts that allocate memory.
\n\nTo explore your memory profile, use the following command :
\n$ go tool pprof -http localhost:9898 profile.pb.gz
\nIt will open an interactive web UI :
\nPprof is a great tool to understand how your programs work and how they perform
With profiling, you can detect performance issues.
However, a clear and understandable program is always easier to maintain than an obscure ultra-performing program.
What is the name of the package that generates profiles?
What are protocol buffers?
Which program can you use to decode a message serialized with protocol buffers?
What is the “call stack”?
In the output of the UNIX time
command what is the difference between “user” and “sys” time?
Give one optimization technique to reduce CPU usage.
What is the name of the package that generates profiles?
\nWhat are protocol buffers?
\nThis is a method used to serialize structured data.
Serialized data is smaller.
Which program can you use to decode a message serialized with protocol buffers?
\nprotoc
You can download it here https://github.com/protocolbuffers/protobuf/releases
What is the “call stack” ?
\nThe call stack of a program is the list of currently running functions.
At the beginning of the stack, you will find the main
function from package main
(program entry point)
In the output of the UNIX time
command what is the difference between “user” and “sys” time?
user: This time corresponds to the amount of time the CPU (processor) was busy executing instructions outside the kernel (user space).
sys: This time measure corresponds to the time taken by the CPU to execute commands in the kernel space, for instance, system calls.
Give one optimization technique to reduce CPU usage.
\nProfiling = collecting detailed statistics about how a program runs
Profiling is different from Benchmarks. A benchmark measures the performance of one function.
To create a CPU profile, add those lines to your main function :
f, err := os.Create("profile.pb.gz")\nif err != nil {\n log.Fatal(err)\n}\npprof.StartCPUProfile(f)\ndefer pprof.StopCPUProfile()
\n“Protocol buffers” is a method used to serialized structured data.
It can transform structured data into a lightweight format that can be stored and transmitted over a network.
The profile file generated by go is serialized using Protocol Buffers.
The call stack is an ordered list of active functions in a program. The call stack might grow or shrink depending on the program.
CPU time represents the time used by the Central Processing Unit (CPU) to execute the set of instructions defined in your program.
The execution time of a program can be measured by the time
command on UNIX systems
To display a profile in your web browser, you can use the following command
$ go tool pprof -web yourBinary profile.pb
\nA sample is a measurement. This measure is made at a certain time during the profiling process.
Some common CPU usage optimization techniques are :
\nDead code removal
Loop exit optimization: exit the loop as soon as possible
Loop invariant code removal: remove any instructions that do not depend on the loop local variables.
Loop fusion: if you have two loops that iterate over the same collection, you might want to merge them.
When an operation’s result does not vary, you might want to create a constant.
To create a memory profile, use the following code :
f, err := os.Create("mem_profile.pb.gz")\nif err != nil {\n log.Fatal(err)\n}\nif err := pprof.WriteHeapProfile(f); err != nil {\n log.Fatal("memory profiling is cannot be gathered", err)\n}\ndefer f.Close()
\n$ go tool pprof -http localhost:9898 profile.pb.gz
\nhttps://github.com/google/pprof↩︎
Previous
\n\t\t\t\t\t\t\t\t\tBuild an HTTP Client
\n\t\t\t\t\t\t\t\tNext
\n\t\t\t\t\t\t\t\t\tContext
\n\t\t\t\t\t\t\t\t