Programming for Performance

Working with games means programming, but it is also a special kind of programming.  When working on a game you know will be used for competitive play in some ways is like the Formula One race cars; everything you do needs to consider performance. When working on a game you may know that the game is going to do amazing things and must perform well. I’m going to cover a few key items on performance.

Generally you shouldn’t bother.

First, there is a famous quote from a pioneer in computer science that applies here:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

That was written by Donald Knuth back in 1974. While he was applying the rule to one specific area of programming it is sage advice that still applies.

In his case he was talking about removing some code from outside a loop, which was important with the slow speed of computers at the time. At the time computers speeds were measured in kilohertz, 4-bit processors were common, and programmers generally had advanced degrees in mathematics or engineering.  If you read the code in the article you might think “is that really a big change?” but back then he was reducing a loop from taking about 300 milliseconds per iteration to 200 milliseconds per iteration, so that is indeed a big difference.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

This line became an catchphrase for several decades. It is painful for many people to read, but it is important to understand.  Generally the code you work with every day is not a performance concern.  You should be smart about what you write as bad code can ruin performance, but in the general case, if you try to write code that is defect-free and follows typical coding standards the code will run just fine.

Generally we don’t need to bother.  Generally we can go on our merry ways, blissfully ignoring anything about performance.

Some time ago I wrote up an overview of what is happening inside the CPU which you may want to read if you aren’t familiar with it. Modern processors are amazing. Each processor core (or virtual core for hyper-threaded processors) can decode up to four instructions at once.  Internally there are registers that can run three or more sub-instructions per cycle. Generally the biggest problem is not the time required to run an individual computation. The most common problem is ensuring there is data to work with.

And that leads us to the first big feature that is part of that all-important 3%.

Yet we should not pass up our opportunities in that critical 3%.

As much as we don’t need to worry about performance, that doesn’t mean we should intentionally make things slow.  It also isn’t universal. His off-the-cuff number of 3% feels about right to me in practice.  Usually most of the code isn’t a bottleneck.  Usually you find just a few elements in the code that are painfully slow, and a few more elements in code that cause performance to drop in unacceptable ways.

What follows are some simple patterns you can follow to help with those critical parts.

Keep your CPU Well Fed

The first one of those rules is to make sure you take advantage of something modern computer processors do extremely well.

First, let’s give you a sense of scale. I’m going to talk about things like nanoseconds, but that is often hard to understand. Modern processors are able to compute values really fast.  There was a teacher years ago who wanted to show people how long a nanosecond is.  Most people are familiar with light, and are aware of how fast it travels. The teacher wanted to show everyone how short a nanosecond is.  Imagine a length of about 12 inches long — or about 30 centimeters long.  For many people that is roughly the length from the elbow to the hand.  That distance is how far light can travel in one nanosecond.

A single CPU core generally runs about four cycles per nanosecond, and the single core can cover up to four instructions per cycle, although often they are limited to two or one; generally the number is closer to 8-10 decoded instructions per nanosecond.  And most computers have two or four or even more CPU cores inside them, with some server CPUs containing 24 cores.  On such systems, by the time a single photon of light can travel from your elbow to your wrist such a system could have decoded more than 250 operations.  Or, if the programs are stalled, they could be sitting around idle waiting for data.  Waiting around as that same photon travels mile after mile, just waiting, accomplishing no work, with terrible performance.

The first goal, then, is to make sure the CPU has something to do and doesn’t spend time waiting around for work.

Everything you do in computer programming requires two things:  instructions and data.  Both of these are kept in memory.  Unfortunately memory is slower than the CPU speed.  Even though computers get faster and faster every year, usually CPU speeds get slightly faster than memory speeds. Today that means that even though a single CPU cycle takes about 0.25 nanoseconds, a request to get something from main memory takes about 120 nanoseconds.  Think about the scale for a moment.  While it could be doing about 10 things as it travels the length of your forearm, instead it is sitting there idle as light travels the length of a building. If each cpu cycle were one second long, the CPU would be sitting there for six minutes waiting for data instead of working on to the next instruction.  Then the next instruction would be run, it would need data, and again the CPU would stall.  Without the cache since most instructions require data, the CPU would be waiting around for every instruction rather than doing useful work.

Starting in the 1980s CPU designers incorporated faster memory called caches.  This high speed memory is expensive, so they’re typically small.  I’ll go over the different caches and their approximate speeds:

  • L1 cache is the fastest and smallest. Today’s CPUs typically have 64 kilobytes of L1 cache. Usually 32 kilobytes is for instructions, and 32 kilobytes is for data.
    • Access time for one “line” (64 bytes) is generally less than one nanosecond, or about 3-4 clock cycles.
    • Generally programs can access any additional byte on the line for zero (0) additional clock cycles.  The first byte costs some time, the remaining 63 are FREE and FAST.
  • L2 cache is a little slower and a little larger, and today’s CPUs typically have 256 kilobytes per CPU core.
    • Access time is about 3 nanoseconds, or about 10 clock cycles.
    • Cache lines are currently also 64 bytes. If you use one of the bytes or all 64 bytes, the cost is the same.
  • L3 cache is slower and larger still, and ranges from not being present on cheaper or slightly older computers, to 4-16 megabytes on high end desktop computers, up through 50+ megabytes on high-end server chips.
    • Access time is about 12 nanoseconds, or about 50 clock cycles.
    • Blocks are generally 64 bytes.
  • Main memory, commonly measured in 2 gigabytes to 32 gigabytes, these are the long chips usually installed in pairs on your motherboard.
    • Access times around 100 to 150 nanoseconds for most systems, but can be much slower on cheap chips.

That’s the locations of memory.  The main computer is constantly doing work to predict what data the program will need next so it can prefetch the data to the cache. Because of the way the journey through the CPU pipeline works, when data is available on an L1 cache line the first time it is used there is a small delay, but using adjacent data is (practically) free.  So if code is reading 4 32-bit values and they’re all in an L1 cache line together, it takes almost exactly the same amount of time to use one of those values as it does to use all four of those values; you pay the price once and get the following three for nearly free.

That should make it clear that there is a big performance benefit to keeping your data in the closest cache available. The farther it gets from the CPU the more time it takes to wait.

Instructions are pretty easy to predict. Most instructions do one thing and then advance to the next instruction.  Some instructions cause a jump to a new location; the CPU can figure out where the jump will go and start prefetching that location. Data is sometimes harder to predict. If the CPU can observe that the data is in a specific place it can prefetch it immediately. If the CPU can observe it is a relative place to somewhere it already knows, it can also prefetch it.  But sometimes the CPU needs to look at more complex memory access patterns, and today’s CPUs are quite good at this type of prediction.

For code that usually means avoiding branches.  Thankfully your compiler and optimizer do a really good job of rewriting your code behind your back to reduce branches or to make them into easy-to-predict jumps.  You can help your compiler out here by keeping your code simple and following simple paths. Once upon a time it was common for programmers to use jump tables and dynamic return addresses to speed up code, or somewhat complicated tools like a Duff’s Device. Don’t do that. Keep it simple and linear, keep it easy to predict.

For data, that usually means accessing values sequentially.  Walking across an array of integers or an array of floating point values is very easy for the CPU to predict.  As you progress along the array the CPU will observe the pattern in the CPU pipeline and start prefetching values immediately. Walking across an array with a stride is a little bit harder for the CPU to predict unless the stride happens to be at specific boundaries, such as at 8-bite or 32-byte or 64-byte steps.

Unfortunately those steps mean you cannot reuse adjacent values, and remember from above that adjacent values are nearly free.  Many programmers will build a structure with values, such as a structure with X, Y, and Z, and then build an array of these structures. As the processing takes place and let’s say the total object size is 64 bytes, the prefetch system can see the accesses at base+48, base+52, base+56 for the first item, then another set at base+112, base+116, and base+120 for the second item, then at base+176, base+180, base+184, etc. The prefetcher is smart enough to figure out what to prefetch, but the data accesses can be made faster.  Instead of an array of structures, programmers should identify the pattern themselves and try to build systems that traverse a structure of arrays.  That is, have an array of X values, an array of Y values, and an array of Z values, and work on all the arrays at once.  During processing it looks like baseX+0, baseY+0 and baseZ+0; then baseX+4, baseY+4, baseZ+4; then baseX+8, baseY+8, baseZ+8, and so on.  Now the items are accessed continuously in cache, and you get benefits both from prefetch and from being adjacent in the cache.

Even better, when you are traversing a structure of arrays sometimes the CPU will identify the pattern as something that can run on SIMD instructions — that is Single Instruction Multiple Data — and provide even more speedups behind your back. Even if it doesn’t find those automatically, advanced programmers can use SIMD instructions to work on several values with a single command, sometimes 2 values, or 4 values, or 8 values, or even 32 math operations done at once and all using a single cache line.  That’s among the biggest benefits you can get. Not only does it keep the CPU fed with data, it consumes and processes the data quickly as well.

For the analogy, with one CPU core and one nanosecond the length of your forearm, that’s 128 operations during that time. On an 8-core system that’s 1024 operations during the time required for light to travel the length of your forearm.  That’s blazing fast performance acquired by making your data line up sequentially.

On the flip side, jumping around in data generally produces a lot of cache misses.  It doesn’t always do it, since you can craft algorithms that jump around in predictable patterns or that repeatedly jump around in a reliable path. But most of the time if you are hopping around through a data structure doing a binary search or similar, every hop you take will miss the cache, so every call will trigger an additional delay.  This can have a big impact on performance. In computer science classes everyone is taught that a linear search of data takes roughly O(n) time, and a binary search of the data takes O(log n) time, but people tend to forget that those are only talking about the asymptotic case, or what happens when the number of items is so big that it dwarfs all other effects like cache speed.  For small pieces of data, when you are working with thousands or even tens of thousands of adjacent values, the cache misses are far more expensive than traversing the array.  Let’s consider an array with 5000 items in it. A binary search on 1000 elements will find the item in 10 hops. Each one of those causes a cache miss, and assuming the data was already living in L2 cache, that is about 100 nanoseconds or 400 cycles. In contrast, if the data were searched in a linear array using SIMD instructions to get roughly 4 tests per cycle, the entire array of 1000 items could be searched in about 250 cycles or about 65 nanoseconds.

You don’t need to take big changes for this one, but when in doubt choose simple linear structures and linear access if you want fast code.  This one isn’t a premature optimization, it is the basic step of selecting an algorithm that takes advantage of the fundamental designs of today’s hardware.

Profile before, profile after

So you’ve built your code and you still think it is slow.  You might guess where you think the code is slow, but it is generally surprising.  The quote at the top of the article covers this nicely:

A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

This is where the scientific discipline in programmers becomes critically important.  We do not guess at where performance is suffering, we use tools to measure the code, to figure out exactly what is slow and why, to correct the code, and then measure again to ensure the issue is addressed.

The tool to create those measurements is called a profiler. I’ve been thinking about writing a few examples of using various profilers, but it will need to wait for another posting.  There are two ways to run a profiler, one way is called a sampling profiler and the other is called an instrumenting profiler.  Both of them are useful for different measurements.

A sampling profiler runs alongside your program, hooked up to the process using the debugging system. Every few microseconds the profiler pauses the program in a debugger, looks at the debug information to see where the program is currently at, then resumes the program making a note of the location.  Running for a few seconds allows the sampling profiler to peek into the program several thousand times. The developer can then stop profiling and look at the statistics.  Usually you will discover that one or two functions show up more frequently than all the others. It generally means the function is doing a lot of work in that location and needs to be addressed.

An instrumenting profiler is generally a library of code that needs to be compiled in to your program.  A side effect is that instrumenting profilers can affect the performance results. However, the instrumenting profiler reveals far more information, such as how many times each function was called, what child functions were called, and how long each function was called.  Like above, generally when you look at profile results a few functions will show up as consuming inordinate amounts of time.  The instrumenting profiler can also show you functions being called surprisingly more times than necessary, or reveal code paths you did not realize existed.

The first few times code is profiled there are generally some surprising realizations. In one code base, I discovered that a library was calling strlen() to calculate the length of a few strings millions of times each graphics frame.  The programmer had made a simple mistake of placing strlen() inside a loop as various strings were processed, and each time through the loop he checked against the function.  Each string was a few hundred characters long, and there were a few thousand strings, resulting in millions of calls to the function. In another code base we discovered that std::vector allocations were taking place thousands of times every frame; programmers were simply appending to arrays but they had forgotten to reserve space in advance.  Instead of a single memory allocation up front followed by a bunch of insertions, each value being added triggered the array to grow, which meant a new buffer was allocated and all the old values copied over.  In another code base we discovered that one of the logging utilities was incredibly slow each time the code created a log entry; instead of working in the background the logger was misconfigured to wait until the data was sent to the disk before returning control to the program.

Usually the first time you profile the code these errors are quite obvious and easy to correct.  Measure to find the worst offenders. Make a change that you think will correct the error, and measure again to ensure the change worked.  Normally the code will be much faster and you can submit your code to the code repository with a comment like “now 97% faster”.

Once those easy issues are resolved the profiler gets harder to read.

Soon you will start looking over the instrumenting profiler hunting for functions that are suspicious. You’ll look for searches and processing that fits within the processing budget but still takes longer than you think it should. You’ll find loops that are running a reasonable number of times, but wonder if they could be reduced. You’ll find functions that are being called for utility reasons, but question if they are really useful. This profiling gets much more difficult.  Always remember whenever you make a change that you measure before, measure after, and be certain you made an improvement.

As your code matures and has been profiled by several developers, you’ll find that the slowness isn’t caused by programmer error, instead the slowness is caused by using an algorithm that is less efficient than the alternative.  This is where the 3% rule kicks in from the quote.  It is also an area where experience and research history are extremely valuable. The developer will be reviewing the profiler results along with the code, and recognize that the task can be solved using a different algorithm.  Usually these better-performing algorithms are a more difficult to implement or have more difficult data structure use, but they produce results more efficiently. Perhaps an algorithm has additional overhead of 700 nanoseconds, but requires less processing saving 50 nanoseconds off each loop, so every frame the total compute time required is reduced by 12%. The only way to know for certain that it actually improved the situation is to measure before and measure after, so be certain you figure out exactly how much more awesome the new code is by measuring it. If you forget to measure your code may actually be slower; there have been many times I’ve seen algorithms replaced in the name of being faster but once profiled it is revealed they are slower.  When I see someone submitting a change for optimization reasons, it better either be obvious how it improves it or I expect them to include the timings before and after in the code summary.

Also as a good practice, if you use a fancy algorithm or something that isn’t immediately obvious it is best to add a comment in the code about it.  Perhaps something like:  // this uses the Quake3 inverse square root method documented at {website} for much faster computing.  For technical papers from conferences or other presentations it can make sense to include the document describing the algorithm in the same place in the source tree so future programmers don’t need to hunt for it and hope the paper hasn’t moved on the Internet.  Make sure the paper is freely available for use that way before you include it in your source tree. Having the description of the algorithm can help when someone comes along to fix a bug, or when they don’t understand the algorithm and need to know what the code is doing, or just in case they are trying to see if there is a more performant method available.

Profile before, profile after, and include the speedup difference in the checkin notes and comments.

Framerate in Frames Per Second is nearly useless for developers

This one isn’t so much of a performance concern, but it is something I see frequently online and feel it is important for developers to adjust.

Having a bright yellow number of your frames per second isn’t particularly useful if you’re looking for performance metrics.  You can have code that runs at 60 or 75 or even 120 frames per second and still looks like a slide show.

Far more useful is a triple set of numbers:  minimum frame time, average frame time, maximum frame time.  You can throw the frames per second in as well, if you want. Usually these are given in milliseconds. Also, usually these are updated on the display once per second so they aren’t in constant flux.

Then you can look at the numbers and know far more about your program.

The values:  12.002 min/14.351 avg/15.234 max tell you that your fastest frame took about 12 milliseconds, the fastest frame took about 15 milliseconds, and the average is 14 milliseconds.  At that rate you’ve missed the 75 frames per second of faster refresh monitors and you are getting close to missing the 60 frames per second of most common monitors. It means you need to start hunting for performance improvements, but that you’re basically stable.

The values: 7.032 min / 14.351 avg / 29.245 max tell you that your game’s performance is erratic. You’ve got the same average frame rate as the example above, but several of your frames take far too long to display, and others return unexpectedly fast. This type of unbalanced numbers mean you need to look for erratic processing. You are doing different amounts of work each frame, so you’ll need to find work that can be spread out more evenly over time, and hunt for those frames that take too long. If you are shooting for smooth performance on 75 Hz monitors or 120 Hz monitors you’ve got a long way to go to ensure no frames are taking too long.

Finally, when people have really high frame rates, seeing something like 2513 frames per second, and then they make a change and it drops by several hundred frames per second, that is not an issue.  If that same code were using the triple described above they’d see they’re taking less than one millisecond per frame, and they’ve jump from about 0.4 milliseconds to 0.5 milliseconds. Both speeds are far below what you need to worry about.

That’s going to be it for this session. If you’ve got additional ideas on what to cover in more depth or other comments, drop a comment or send me a message.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.