{"id":320,"date":"2016-09-24T00:00:44","date_gmt":"2016-09-24T00:00:44","guid":{"rendered":"http:\/\/BryanWagstaff.com\/?p=320"},"modified":"2016-10-06T06:14:56","modified_gmt":"2016-10-06T06:14:56","slug":"programming-for-performance","status":"publish","type":"post","link":"https:\/\/bryanwagstaff.com\/index.php\/programming-for-performance\/","title":{"rendered":"Programming for Performance"},"content":{"rendered":"<div class=\"entry\">\n<p>Working with games\u00a0means programming, but it is also a special kind of programming. \u00a0When working on a game you know will be used for competitive play in\u00a0some ways is like the Formula One race cars; everything you do needs to consider performance. When working on a game you may know that the game is going to do amazing things and must perform well. I&#8217;m going to cover a few key items on performance.<\/p>\n<p><!--more--><\/p>\n<h1>Generally you shouldn&#8217;t bother.<\/h1>\n<p>First, there is a famous quote from a pioneer in computer science that applies here:<\/p>\n<blockquote><p>Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We\u00a0<em>should<\/em> forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only\u00a0<em>after<\/em> that code has been identified.<\/p><\/blockquote>\n<p>That was <a href=\"http:\/\/web.archive.org\/web\/20130731202547\/http:\/\/pplab.snu.ac.kr\/courses\/adv_pl05\/papers\/p261-knuth.pdf\">written by Donald Knuth back in 1974<\/a>. While he was applying the rule to one specific area of programming it is sage advice that still applies.<\/p>\n<p>In his case he was talking about removing some code from outside a loop, which was important with the slow speed of computers at the time. At the time computers speeds were measured in kilohertz, 4-bit processors were\u00a0common, and programmers generally had advanced degrees in mathematics or engineering. \u00a0If you read the code in the article you might think &#8220;is that really a big change?&#8221; but back then he was reducing a loop from taking about 300 milliseconds per iteration to 200 milliseconds per iteration, so that is indeed a big difference.<\/p>\n<h2>We\u00a0<em>should<\/em> forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.<\/h2>\n<p>This line became an catchphrase for several decades. It is painful for many people to read, but it is important to understand. \u00a0Generally the code you work with every day is not a performance concern. \u00a0You should be smart about what you write as bad code can ruin performance, but in the general case, if you try to write code that is defect-free and follows typical coding standards the code will run just fine.<\/p>\n<p>Generally we don&#8217;t need to bother. \u00a0Generally we can go on our merry ways, blissfully ignoring anything about performance.<\/p>\n<p>Some time ago I wrote up an <a href=\"http:\/\/BryanWagstaff.com\/index.php\/a-journey-through-the-cpu-pipeline\/\">overview of what is happening inside the CPU<\/a>\u00a0which you may want to read if you aren&#8217;t familiar with it. Modern processors are amazing. Each processor core (or virtual core for hyper-threaded processors) can decode up to four instructions at once. \u00a0Internally there are registers that can run three or more sub-instructions per cycle. Generally the biggest problem is not the time required to run an individual computation. The most common problem is ensuring there is data to work with.<\/p>\n<p>And that leads us to the first big feature\u00a0that is part of that all-important 3%.<\/p>\n<h2>Yet we should not pass up our opportunities in that critical 3%.<\/h2>\n<p>As much as we don&#8217;t need to worry about performance, that doesn&#8217;t mean we should intentionally make things slow. \u00a0It also isn&#8217;t universal. His off-the-cuff number of 3% feels about right to me in practice. \u00a0Usually most of the code isn&#8217;t a bottleneck. \u00a0Usually you find just a few elements in the code that are painfully slow, and a few more elements in code that cause performance to drop in unacceptable ways.<\/p>\n<p>What follows are some simple patterns you can follow to help with those critical parts.<\/p>\n<h1>Keep your CPU Well Fed<\/h1>\n<p>The first one of those rules is to make sure you take advantage of something modern computer processors do extremely well.<\/p>\n<p>First, let&#8217;s give you a sense of scale. I&#8217;m going to talk about things like nanoseconds, but that is often hard to understand. Modern processors are able to compute values really fast. \u00a0There was a teacher years ago who wanted to show people how long a nanosecond is. \u00a0Most people are familiar with light, and are aware of how fast it travels. The teacher wanted to show everyone how short a nanosecond is. \u00a0Imagine a length of about 12 inches long &#8212; or about 30 centimeters long. \u00a0For many people that is roughly the length from the elbow to the hand. \u00a0That distance is how far light can travel in one nanosecond.<\/p>\n<p>A single CPU core generally runs about\u00a0four cycles per nanosecond, and the single core can cover up to\u00a0four instructions per cycle, although often they are limited to two or one; generally the number is closer to 8-10 decoded instructions per nanosecond. \u00a0And most computers have two or four or even more CPU cores inside them, with some server CPUs containing 24 cores. \u00a0On such systems, by the time a single photon of light can travel from your elbow to your wrist such a system could have decoded\u00a0more than 250\u00a0operations. \u00a0Or, if the programs are stalled, they could be sitting around idle waiting for data. \u00a0Waiting around as that same photon travels mile after mile, just waiting, accomplishing no work, with terrible performance.<\/p>\n<p>The first goal, then, is to make sure the CPU has something to do and doesn&#8217;t spend time waiting around for work.<\/p>\n<p>Everything you do in computer programming requires two things: \u00a0instructions and data. \u00a0Both of these are kept in memory. \u00a0Unfortunately memory is slower than the CPU speed. \u00a0Even though\u00a0computers get faster and faster every year, usually CPU speeds get slightly faster than memory speeds.\u00a0Today that means that even though a single CPU cycle takes about 0.25 nanoseconds, a request to get something from main memory takes about 120 nanoseconds. \u00a0Think about the scale for a moment. \u00a0While it could be doing about 10 things as it travels the length of your forearm, instead it is sitting there idle as light travels the length of a building. If each cpu cycle were one second long, the CPU would be sitting there for six minutes waiting for data instead of working\u00a0on to the next instruction. \u00a0Then the next instruction would be run, it would need data, and again the CPU would stall. \u00a0Without the cache since most instructions require data, the CPU would be waiting around for every instruction rather than doing useful work.<\/p>\n<p>Starting in the 1980s CPU\u00a0designers incorporated faster memory called caches. \u00a0This high speed memory is expensive, so they&#8217;re typically small. \u00a0I&#8217;ll go over the different caches and their approximate speeds:<\/p>\n<ul>\n<li>L1 cache is the fastest and smallest. Today&#8217;s\u00a0CPUs typically have 64 kilobytes of L1 cache. Usually 32 kilobytes is for instructions, and 32 kilobytes is for data.\n<ul>\n<li>Access time for one &#8220;line&#8221; (64 bytes) is generally less than one nanosecond, or about\u00a03-4 clock cycles.<\/li>\n<li>Generally programs can access any additional byte on the line for zero (0)\u00a0additional clock cycles. \u00a0The first byte costs some time, the remaining 63 are FREE and FAST.<\/li>\n<\/ul>\n<\/li>\n<li>L2 cache is a little slower and a little larger, and today&#8217;s CPUs typically have 256 kilobytes per CPU core.\n<ul>\n<li>Access time\u00a0is about 3 nanoseconds, or about 10 clock cycles.<\/li>\n<li>Cache lines are currently also 64 bytes. If you use one of the bytes or all 64 bytes, the cost is the same.<\/li>\n<\/ul>\n<\/li>\n<li>L3 cache is slower and larger still, and\u00a0ranges from not being present on cheaper or slightly older computers, to\u00a04-16 megabytes on high end desktop computers, up through 50+\u00a0megabytes on high-end server chips.\n<ul>\n<li>Access time is about\u00a012 nanoseconds, or about 50 clock cycles.<\/li>\n<li>Blocks are generally 64 bytes.<\/li>\n<\/ul>\n<\/li>\n<li>Main memory, commonly measured in 2 gigabytes to 32 gigabytes, these are the long chips usually installed in pairs on your motherboard.\n<ul>\n<li>Access times around 100 to 150 nanoseconds for most systems, but can be much slower on cheap chips.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>That&#8217;s the locations of memory. \u00a0The main computer is constantly doing work to predict what data the program will need next so it can prefetch the data to the cache. Because of the way the <a href=\"http:\/\/BryanWagstaff.com\/index.php\/a-journey-through-the-cpu-pipeline\/\">journey through the CPU pipeline<\/a> works, when data is available on an L1 cache line the first time it is used there is a small delay, but using adjacent data is (practically) free. \u00a0So if code is reading 4 32-bit values and they&#8217;re all in an L1 cache line together, it takes almost exactly the same amount of time to use one of those values as it does to use all four of those values; you pay the price once and get the following three for nearly free.<\/p>\n<p>That should make it clear that there is a big performance benefit to keeping your data in the closest cache available. The farther it gets from the CPU the more time it takes to wait.<\/p>\n<p>Instructions are pretty easy to predict. Most instructions do one thing and then advance to the next instruction. \u00a0Some instructions cause a jump to a new location; the CPU can figure out where the jump will go and start prefetching that location. Data is sometimes harder to predict. If the CPU can observe that the data is in a specific place it can prefetch it immediately. If the CPU can observe it is a relative place to somewhere it already knows, it can also prefetch it. \u00a0But sometimes the CPU needs to look at more complex memory access patterns, and today&#8217;s CPUs are quite good at this type of prediction.<\/p>\n<p>For code that usually means avoiding branches. \u00a0Thankfully your compiler and optimizer do a really good job of rewriting your code behind your back to reduce branches or to make them into easy-to-predict jumps. \u00a0You can help your compiler out here by keeping your code simple and following simple paths. Once upon a time it was common for programmers to use jump tables and dynamic return addresses to speed up code, or somewhat complicated tools like a Duff&#8217;s Device. Don&#8217;t do that. Keep it simple and linear, keep it easy to predict.<\/p>\n<p>For data, that usually means accessing values sequentially. \u00a0Walking across an array of integers or an array of floating point values is very easy for the CPU to predict. \u00a0As you progress along the array the CPU will observe the pattern in the CPU pipeline and start prefetching values immediately. Walking across an array with a stride is a little bit harder for the CPU to predict unless the stride happens to be at specific boundaries, such as\u00a0at 8-bite or 32-byte or 64-byte steps.<\/p>\n<p>Unfortunately those steps mean you cannot reuse adjacent values, and remember from above that adjacent values are nearly free. \u00a0Many programmers will build a structure with values, such as a structure with X, Y, and Z, and then build an array of these structures. As the processing takes place and let&#8217;s say the total object size is 64 bytes, the prefetch system can see the accesses\u00a0at base+48, base+52, base+56 for the first item, then another set\u00a0at base+112, base+116, and base+120 for the second item, then at base+176, base+180, base+184, etc. The prefetcher is smart enough to figure out what to prefetch, but the data accesses can be made faster. \u00a0Instead of an array of structures, programmers should identify the pattern themselves and try to build systems that traverse a structure of arrays. \u00a0That is, have an array of X values, an array of Y values, and an array of Z values, and work on all the arrays at once. \u00a0During processing it looks like baseX+0, baseY+0 and baseZ+0; then baseX+4, baseY+4, baseZ+4; then baseX+8, baseY+8, baseZ+8, and so on. \u00a0Now the items are accessed continuously in cache, and you get benefits both from prefetch and from being adjacent in the cache.<\/p>\n<p>Even better, when you are traversing a structure of arrays sometimes the CPU will identify the pattern as something that can run on SIMD instructions &#8212; that is Single Instruction Multiple Data &#8212; and provide even more speedups behind your back.\u00a0Even if it doesn&#8217;t find those automatically, advanced programmers can use SIMD instructions to work on several values with a single command, sometimes 2 values, or 4 values, or 8 values, or even 32 math operations done at once and all using a single cache line. \u00a0That&#8217;s among the biggest benefits you can get. Not only does it keep the CPU fed with data, it consumes and processes the data quickly as well.<\/p>\n<p>For the analogy, with one CPU core and one nanosecond the length of your forearm, that&#8217;s 128 operations during that time. On an 8-core system that&#8217;s 1024 operations during the time required for light to travel the length of your forearm. \u00a0That&#8217;s blazing fast performance acquired by making your data line up sequentially.<\/p>\n<p>On the flip side, jumping around in data generally produces a lot of cache misses. \u00a0It doesn&#8217;t always do it, since you can craft algorithms that jump around in predictable patterns or that repeatedly jump around in a reliable path. But most of the time if you are hopping around through a data structure doing a binary search or similar, every hop you take will miss the cache, so every call will trigger an additional delay. \u00a0This can have a big impact on performance.\u00a0In computer science classes everyone is taught that a linear search of data takes roughly O(n) time, and a binary search of the data takes O(log n) time, but people tend to forget that those are only talking about the asymptotic case, or what happens when the number of items is so big that it dwarfs all other effects like cache speed. \u00a0For small pieces of data, when you are working with thousands or even tens of thousands of adjacent values, the cache misses are far more expensive than traversing the array. \u00a0Let&#8217;s consider an array with 5000 items in it. A binary search on 1000 elements will find the item in 10\u00a0hops. Each one of those causes a cache miss, and assuming the data was already living in L2 cache, that is about 100 nanoseconds or 400 cycles. In contrast, if the data were searched in a linear array using SIMD instructions to get roughly 4 tests per cycle, the entire array of 1000 items could be searched in about 250 cycles or about 65 nanoseconds.<\/p>\n<p>You don&#8217;t need to take big changes for this one, but when in doubt choose simple linear structures and linear access if you want fast code. \u00a0This one isn&#8217;t a premature optimization, it is the basic step of selecting an algorithm that takes advantage of the fundamental designs of today&#8217;s hardware.<\/p>\n<h1>Profile before, profile after<\/h1>\n<p>So you&#8217;ve built your code and you still think it is slow. \u00a0You might guess where you think the code is slow, but it is generally surprising. \u00a0The quote at the top of the article covers this nicely:<\/p>\n<blockquote><p>A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only\u00a0<em>after<\/em> that code has been identified.<\/p><\/blockquote>\n<p>This is where the scientific discipline in programmers becomes critically important. \u00a0We do not guess at where performance is suffering, we use tools to measure the code, to figure out exactly what is slow and why, to correct the code, and then measure again to ensure the issue is addressed.<\/p>\n<p>The tool to create those measurements is called a profiler. I&#8217;ve been thinking about writing a few examples of using various profilers, but it will need to wait for another posting. \u00a0There are two ways to run a profiler, one way is called a sampling profiler and the other is called an instrumenting profiler. \u00a0Both of them are useful for different measurements.<\/p>\n<p>A sampling profiler runs alongside your program, hooked up to the process using the debugging system. Every few microseconds the profiler pauses the program in a debugger, looks at the debug information to see where the program is currently at, then resumes the program making a note of the location. \u00a0Running for\u00a0a few seconds\u00a0allows the sampling profiler to peek into the program several thousand times. The developer can then stop profiling and look at the statistics. \u00a0Usually you will discover that one or two functions show up more frequently than all the others. It generally means the function is doing a lot of work in that location and needs to be addressed.<\/p>\n<p>An instrumenting profiler is generally a library of code that needs to be compiled in to your program. \u00a0A side effect is that instrumenting profilers can affect the performance results. However, the instrumenting profiler reveals far more information, such as how many times each function was called, what child functions were called, and how long each function was called. \u00a0Like above, generally when you look at profile results a few functions will show up as consuming inordinate amounts of time. \u00a0The instrumenting profiler can also show you functions being called surprisingly more times than necessary, or reveal code paths you did not realize existed.<\/p>\n<p>The first few times code is profiled there are generally some surprising realizations. In one code base, I discovered that a library was calling strlen() to calculate the length of a few strings millions of times each graphics frame. \u00a0The programmer had made a simple mistake of placing strlen() inside a loop as various strings were processed, and each time through the loop he checked against the function. \u00a0Each string was a few hundred characters long, and there were a few thousand strings, resulting in millions of calls to the function. In another code base we discovered that std::vector allocations were taking place thousands\u00a0of times every frame; programmers were simply appending to arrays but they had forgotten to reserve space in advance. \u00a0Instead of a single memory allocation up front followed by a bunch of insertions, each value being added triggered the array to grow, which meant a new buffer was allocated and all the old values copied over. \u00a0In another code base we discovered that one of the logging utilities was incredibly slow each time the code created a log entry; instead of working in the background the logger was misconfigured to wait until the data was sent to the disk before returning control to the program.<\/p>\n<p>Usually the first time you profile the code these errors are quite obvious and easy to correct. \u00a0Measure to find the worst offenders. Make a change that you think will correct the error, and measure again to ensure the change worked. \u00a0Normally the code will be much faster and you can submit your code to the code repository with a comment like &#8220;now 97% faster&#8221;.<\/p>\n<p>Once those easy issues are resolved the profiler gets harder to read.<\/p>\n<p>Soon you will start looking over the instrumenting profiler hunting for functions that are suspicious. You&#8217;ll look for\u00a0searches and processing that fits within the processing budget but still takes longer than you think it should.\u00a0You&#8217;ll find loops that are running a reasonable number of times, but wonder if they could be reduced. You&#8217;ll find functions that are being called for utility reasons, but question if they are really useful. This profiling gets much more difficult. \u00a0Always remember whenever you make a change that you measure before, measure after, and be certain you made an improvement.<\/p>\n<p>As your code matures and has been profiled by several developers, you&#8217;ll find that the slowness isn&#8217;t caused by programmer error, instead the slowness is caused by using an algorithm that is less efficient than the alternative. \u00a0This is where the 3% rule kicks in from the quote. \u00a0It is also an area where experience and research history are extremely valuable. The developer will be reviewing the profiler results along with the code, and recognize that the task can be solved using a different algorithm. \u00a0Usually these better-performing algorithms are a more difficult to implement or have more difficult data structure use, but they produce results more efficiently. Perhaps an algorithm has additional overhead of 700 nanoseconds, but requires less processing saving 50 nanoseconds off each loop, so every frame the total compute\u00a0time required\u00a0is reduced by 12%. The only way to know for certain that it actually improved the situation is to measure before and measure after, so be certain you figure out exactly how much more awesome the new code is by measuring it. If you forget to measure your code may actually be slower; there have been many times I&#8217;ve seen algorithms replaced in the name of being faster but once profiled it is revealed they are slower. \u00a0When I see someone submitting a change for optimization reasons, it better either be obvious how it improves it or I expect them to include the timings before and after in the code summary.<\/p>\n<p>Also as a good practice, if you use a fancy algorithm or something that isn&#8217;t immediately obvious it is best to add a comment in the code about it. \u00a0Perhaps something like: \u00a0<em>\/\/ this uses the Quake3 inverse square root method documented at {website} for much faster computing.<\/em>\u00a0 For technical papers from conferences or other presentations it can make sense to include the document describing the algorithm\u00a0in the same place in the source tree so future programmers don&#8217;t need to hunt for it and hope the paper hasn&#8217;t moved on the Internet. \u00a0Make sure the paper is freely available for use that way before you include it in your source tree. Having the description of the algorithm can help when someone comes along to fix a bug, or when they don&#8217;t understand the algorithm and need to know what the code is doing, or just in case they are trying to see if there is a more performant method available.<\/p>\n<p>Profile before, profile after, and include the speedup difference in the checkin notes and comments.<\/p>\n<h1>Framerate in Frames Per Second is nearly useless for developers<\/h1>\n<p>This one isn&#8217;t so much of a performance concern, but it is something I see frequently online and feel it is important for developers to adjust.<\/p>\n<p>Having a bright yellow number of your frames per second isn&#8217;t particularly useful if you&#8217;re looking for performance metrics. \u00a0You can have code that runs at 60 or 75\u00a0or even 120 frames per second and still looks like a slide show.<\/p>\n<p>Far more useful is a triple set of numbers: \u00a0minimum frame time, average frame time, maximum frame time. \u00a0You can throw the frames per second in as well, if you want. Usually these are given in milliseconds. Also, usually these are updated on the display once per second so they aren&#8217;t in constant flux.<\/p>\n<p>Then you can look at the numbers and know far more about your program.<\/p>\n<p>The values: \u00a012.002 min\/14.351 avg\/15.234 max tell you that your fastest frame took about 12 milliseconds, the fastest frame took about 15 milliseconds, and the average is 14 milliseconds. \u00a0At that rate you&#8217;ve missed the 75 frames per second of faster refresh monitors and you are getting close to missing the 60 frames per second of most common monitors. It means you need to start hunting for performance improvements, but that you&#8217;re basically stable.<\/p>\n<p>The values: 7.032 min \/ 14.351 avg \/ 29.245 max tell you that your game&#8217;s performance is erratic. You&#8217;ve got the same average frame rate as the example above, but several of your frames take far too long to display, and others return unexpectedly fast. This type of unbalanced numbers mean you need to look for erratic processing. You are doing different amounts of work each frame, so you&#8217;ll need to find work that can be spread out more evenly over time, and hunt for those frames that take too long. If you are shooting for smooth performance on 75 Hz monitors or 120 Hz monitors you&#8217;ve got a long way to go to ensure no frames are taking too long.<\/p>\n<p>Finally, when people have really high frame rates, seeing something like 2513 frames per second, and then they make a change and it drops by several hundred frames per second, that is not an issue. \u00a0If that same code were using the triple described above they&#8217;d see they&#8217;re taking less than one millisecond per frame, and they&#8217;ve jump from about 0.4 milliseconds to 0.5 milliseconds. Both speeds are far below what you need to worry about.<\/p>\n<p>That&#8217;s going to be it for this session. If you&#8217;ve got additional ideas on what to cover in more depth or other comments, drop a comment or send me a message.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Working with games\u00a0means programming, but it is also a special kind of programming. \u00a0When working on a game you know will be used for competitive play in\u00a0some ways is like the Formula One race cars; everything you do needs to consider performance. When working on a game you may know that the game is going &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[2,5,4],"tags":[],"class_list":{"0":"post-320","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-education","7":"category-games","8":"category-programming","9":"anons"},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/posts\/320","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/comments?post=320"}],"version-history":[{"count":8,"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/posts\/320\/revisions"}],"predecessor-version":[{"id":354,"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/posts\/320\/revisions\/354"}],"wp:attachment":[{"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/media?parent=320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/categories?post=320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bryanwagstaff.com\/index.php\/wp-json\/wp\/v2\/tags?post=320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}