This file last update on March 28, 2005

Introduction

This document’s primary purpose is to attempt to convince you that your efforts on performance optimization are probably misdirected (give it up d00d, you are wasting your time!). However, if you disagree or are going to ignore my advice, I provide a few pointers on the art of performance optimization. The programming languages this document focuses on are C and C++ (all the way down at the bottom), but as the general concepts are applicable to all languages it is probably worth a skim even if you are a Perl, Java or PHP programmer. I mentioned ‘art’ of optimization as opposed to ‘science’ because there are so many confounding factors in almost any real-world situation that it is nearly impossible to have any assurance that changes you make will have any positive impact (it is quite common to have worse performance when attempting an optimization!). Even though there is a section on C/C++ code optimization near the bottom, please try to avoid jumping there and read this document sequentially. I start with the places that have the biggest bang-for-the buck first and since the code bits are at the bottom, obviously that means I feel that code optimization has the smallest bang-for-the-buck. If you are reading this because you think I may be smarter than you are about optimization, why not give me the benefit of the doubt (for at least the time it takes to read this) and stick with my game plan?

Special thanks to Scorpions4ever, infamous41md and DaWei_M for their comments.


You are visitor number since October 7, 2004.

Why optimization is usually a waste of time

There are a fixed number of hours in a fixed number of days. Since time is a fixed resource, how you choose to apply your time can have great ramifications to your productivity. If you spend a week on ‘optimization’ of a program that then fails to run any faster in a production environment, that week is lost forever, likely impacting delivery dates and possibly even bonuses, reviews and in worst case, your job itself. Learning when and where to spend effort on optimization is part of the art. Recognizing when there is very little chance that any optimization will result under ideal conditions is just as critical as understanding how the flow of data between caches can result in a 100 fold performance boost. The ‘real’ nitty gritty optimization requires detailed understanding of hardware, memory access, instruction behavior, etc. However, this only matters if there is no performance to be squeezed out of any other part of the program AND the CPU is actually the bottleneck.

There is also the payback issue. Let’s say it takes 4 hours for a given program to run and based on your experience you could probably get the program to run in 5 minutes or less. Sounds like a no-brainer, right? What if that program will only ever be run once and it will take you 6-8 hours to make the modifications and test them? Net result: a waste of 2-4 hours, a very poor choice for optimization. On the other hand, it may be valuable to work on a process that is taking 1 second when it is practical to get it to run in 0.01 second if it is running over and over again and the change takes little time to code and test (in one situation there was an insert trigger in a database that was doing a scan on a table with millions of rows and was taking 3 seconds to execute; simply adding a new index to the base table resulted in a 67 fold improvement in performance resulting in a cumulative time savings of hours in just a week).

Another critical issue to keep in mind when deciding whether to invest time in optimization: modern processors are so bloody fast that they are actually in a wait state 70% or more of the time even when they are marked as at 100% utilization. This unfortunate reality is due to the massive discrepancy between CPU register bandwidth and RAM IO bandwidth. (Which is why all these caches are spread all over.) If, by some miracle, you are able to utilize these wasted cycles (keep in mind that 30% utilization is often the highly tuned optimum, in reality it may be much closer to 5% or less) you can see awesome performance boots. If you do manage that, call me and tell me how you did it, then call Intel, AMD, IBM, etc. and get a job earning huge dollars, as they have brilliant people there that are hard pressed to reach that 30% utilization level.

Understanding IO and OS data/instruction caching are critical to reproducing and interpreting experimental results. If the machine has been freshly rebooted then there is essentially nothing (related to your program and/or data) in the various caches. Therefore, on program startup the OS has to first read in your program from the disk (it typically does so by memory mapping the file, so the file may not actually ever entirely make it into RAM), then your program has to do its startup, likely reading from a file somewhere (or possibly even a database on another machine). Up to this point there is nothing you can do to your program to have the slightest impact on its performance. If your program quits at this point and then you rerun it, you are likely to find that it’s ‘performance’ is much better. That is because the instructions and data have already been read into memory and it is working from a cached version. Let’s say your program processes a large (multi-megabyte) file. The very first time it processes the file it is slow, thereafter it runs 10-100 times faster. Why? Because the first time the file was on disk and had to be read in (disk IO is often 1,000 times slower than RAM IO), the second time it was already cached in RAM. How do you measure that sort of thing during optimization trials? It ain’t easy! Rebooting your machine after each recompilation would be very tedious and time consuming. But look, if your process is IO bound, there is nothing you can do to make it complete sooner anyway, why are you bothering with optimization? The best you can do is free up CPU cycles for other processes! If you find that your program is IO bound, then you are done ‘optimizing’ unless you can figure some way to reduce the disk access time (via compression, use of binary files, perhaps indexs if you don’t need the entire dataset, etc.).

Lets say you are running a program communicating over a socket (to a database, for file sharing, web, mail, etc.). While excellent networks today are often faster than pedestrian disk IO (gigabit Ethernet vs IDE drives), they are still in the realm of 1,000 times slower than RAM. To add insult to injury, there is no caching with network communication, so every single request will have to be repeated at its same slow rate. Therefore (as in disk IO), if your program is network bound, you are done; there is nothing to optimize.

There is also the cost of maintenance. It is very rare when optimization results in code that is easier to read and maintain (or is even the same), generally it results in more complex and difficult to follow code. Be sure that when you factor in the cost of programming the changes you add in the code for future maintenance!

So you are going to optimize anyway

OK, the presumption at this point is either you are too hard-headed to take my advice and devote your time to something more useful or you are convinced that there really is some room for optimization. I am going to attempt to lead you down the garden path and try to focus your attention on the major places to get the biggest performance payoffs. If you have read this far and haven’t yet skipped to the actual code section, please bear with me a bit longer. Just to repeat: the code is the last place to look for optimization.

It is the algorithm

Never (and I mean NEVER) start optimization until you have reviewed and understood the algorithm. It is quite possible to get a multiple order of magnitude increase in performance just by changing the algorithm. Besides, if you don’t know what the algorithm is doing, how can you rationally expect to have a meaningful impact on performance? This is not to say that you need a complete understanding from end-to-end, nor a total comprehension of the detailed requirements of the complete application, but you do need to have a good working knowledge of what the point of the whole thing is. For instance, say you have an app that is FTP’ing data to another machine and because it is doing so through an SSL tunnel, it is now a performance bottleneck. What if it turns out that the target machine was NFS mounted and you could simply copy the files directly? (This, by the way, is a true story!) Understanding that you were simply copying files from an internal working directory to an external public directory would allow you to make that intellectual ‘leap’ to realize that the FTP (and the SSL tunneling added as a ‘security’ measure) was not needed, so instead of spending time optimizing the assembler on the SSL encryption algorithm (like you could beat the guys who do this for a living), you can simply restructure the program. Another example: I was asked to work on a process that was taking 15 hours or longer to run, and as a consequence was only being executed once every 6-8 weeks. After studying the algorithm (basically it was making around 14 database queries for each of approximately 90,000 records), I realized I could cache the data from the database. That resulted in the program running in less than 15 minutes (a 60 fold improvement!) and is now being run daily. By only changing the algorithm I had the dramatic improvements in performance; I made no changes to the language; it was still written in Perl, and no changes to the data; it was still talking to a database across a network (I did tweak the queries quite a bit, though), and other than the changes to accommodate the cached data, no changes to the program itself.

Sometimes it is the hardware

People sometimes forget that not all machines have been upgraded as the years have gone by. Sometimes you can get order of magnitude performance increases by simply changing machines. For instance, years ago when I was a tester my approach was to write my own program to the same requirements and then test the output from my program against that of the program I was testing. If they were the same, there was a good (but far from absolute) chance that both programs matched the requirements. In this particular case there was this pile of data that had to be matched against this other pile of data. I wrote my program in C on an un-utilized dual 275 MHz Alpha with 256 MB RAM (an awesome machine for the day). It ran in 3 hours and was IO bound the entire time (meaning that any further optimization to the code was a total waste of effort). As it turned out, the process I was going to test was performing a bit slower, estimated to take 15 days! The main reason for this (an SAS program on an older IBM mainframe) was that the machine was being hammered AND the data was being ‘cached’ on tape. The magic wand was waved and my process became production and the other programmer’s process was then used to validate my results. Even though this was a far from an even comparison (not even apples to oranges, possibly lizards to trees), it could be argued that I was able to realize a 120 fold (more than two orders of magnitude!) increase in performance by utilizing more modern hardware.

Occasionally it is the program logic

Sometimes the best intentions are beaten by the actual implementation. I had a process that was very parallelizable and wanted to get performance boosts by breaking the problem into bits and running the bits at the same time. I felt the problem was perfect for multi-threading (my first heavy-duty effort in threading) and spent a week or so breaking the problem out so it could be run in parallel. The 7 parallel threads (I was on an 8 processor machine) ran in about 110% of the time as the single threaded process! I made my way to one of our in-house gurus and, after looking at my code for about 5 minutes, he said the problem was clear: I was using multi-threading! I guess that blank look on my face spoke volumes because he then immediately launched into an explanation and it was several minutes before I caught up with it. It turns out that an object I was using was constantly allocating and de-allocating memory. As each thread had the same sort of object all doing the allocation/de-allocation they were competing against one another. Multi-threaded programs (at least on Solaris) have only a single heap and have to access it through a critical section (meaning only one thread can be using it at a time). Because of this contention, the program was essentially running as if it were single threaded, except it had the additional time from the overhead of scheduling. It may be obvious in retrospect (though if I had thought of it I probably would have assumed that the object in question would keep the memory it had already allocated), but nothing in the code I wrote was causing problems. My final solution was to multi-process my application, thus solving the single heap problem (each process gets its own heap, doanchano). Of course I did not get a linear performance boost, but I was able to get the program’s total run time from something like 90-120 minutes to a more workable 20-25 minutes.

The 10/90 rule

In almost all cases, 90% or more of the execution time is spent in 10% or less of your code (DaWei_M tells me that in his experience it is closer to 95/5 (5% of the code is doing 95% of the work). Knowing where that 10% is makes all the difference in the world. There is no point in optimizing code that is executed once, is IO bound anyway and represents no measurable decrease in overall runtime if eliminated all together. There is also the plain math of optimization. Say you have a section of code that represents 50% of your overall execution time. Let’s wave the magical optimization wand and say you can get its execution time to zero (and still do what needs to be done). The best that can happen is that the overall runtime of your program will be cut in half! The places to optimize are the places that are taking up most of the execution time. Unless you are working on a very long running process, it is probably not worth your while to even consider expending effort unless the section of code you are going to tackle is taking up at least 50% of the overall execution time. Sometimes it is best to just move on to other projects. Really.

If you have the GNU tools available to you, gprof can help you understand where your code is spending its time.

How modern hardware works, an overview

My focus is on commonly available CPUs running popular commercial operating systems, but there are plenty of exceptions (but this accounts for probably better than 95% of the cases). If you are working on embedded programs with small amounts of resources then a lot of what I am talking about won’t be useful (some may, but having never done that sort of programming I can’t offer suggestions what is and what isn’t). There are essentially three layers of memory associated with any given process. The slowest (and cheapest) is disk (does anyone still use tape?). Disk IO is about 1,000 (that is three orders of magnitude!) slower than RAM access, which is our second tier of memory. RAM is also much more expensive (though seemingly getting cheaper every day; my now low-end workstation has 1.5 GB RAM, more than hard drives used to have not that long ago). Then, at the top tier (and most expensive of all) are the registers where the CPU actually does its work. The current 32 bit Intel chips have some 8 general purpose registers, so 8 * 4 = a massive 32 bytes of register space. How much did you spend on that chip? Registers run at the same speed as the CPU, so when you see a 3 GHz chip, the register is capable of acting on things 3 billion times a second. However, accessing main RAM is done at an often sluggish 133 MHz, though faster machines can push that to 400 MHz. Let’s go with the faster speed RAM: 3 / 0.4 = 7.5; so under ideal conditions the CPU is executing more than 7 times faster than it can read data from RAM (meaning that without some games to play, the processor is forced to be idle some 85% of the time). To attempt to equalize things a bit CPU manufacturers (and sometimes mother board builders) will add small chunks of faster (and more expensive) memory closer to the CPU. These are called caches (cache is pronounced ‘cash’ for whatever reason (must be French)). Due to the way the hardware interacts with the cache, bigger is sometimes not better, so be cautious when reading the marketing brochures. The hardware then attempts to move the most likely to be needed bits from main RAM to one or another cache snuggled up near the CPU. Sometimes that works great; sometimes that actually results in slower performance. Your performance depends on how busy the machine is, how the OS schedules and allocates CPU processing time and how the hardware is programmed to manage the caches all mixed in to the layout of your code, data and your data access patterns. Starting to get the idea it is really more about art than science yet?

Pipelining is how modern processors attempt to increase the actual CPU to greater than the above mentioned 5-30% utilization. By arranging multiple instructions to execute in a special sequence, it is often possible to reduce the average number of instructions it takes to execute a given series of instructions when compared to the sum of the instructions independently. This is a great thing that can increase throughput by several times, but can be easily defeated by having code that jumps around during execution. Most modern processors even execute instructions out of order and will speculatively execute both branches of a conditional (there are millions of extra circuits just to handle the sequencing of all this; it is totally transparent to the program (and programmer)). Something to keep in mind: most processors will stall (flush the pipeline, throw away all the speculatively executed instructions) if the code jumps outside of the current instruction cache. This can happen when an exception is thrown, a function call is made, or a jump is made forward of a looping construct. Minimizing the potential for a stall is one way to increase performance.

Turn on compiler optimization!

I hate to make obvious statements, but if you are testing the performance of your program in debug mode you are a bonehead. Compilers deliberately follow the layout of the code you actually wrote when in debug mode so that it is comprehensible to you when you view it in the debugger. Even moderate levels of compiler optimization can result in order of magnitude increases in performance (and often completely incomprehensible code). There are generally several levels of optimization; you want to be sure to have inlining turned on at a minimum. Function call overhead is rather expensive and the whole point of compiler optimization is so that you can write nice streamlined, easy to read and maintain code without sacrificing performance in production. The use of inline functions allows you to have the best maintainability without wasting cycles incurring senseless function call overhead. In the world of classes, good OOD calls for the use of ‘getters’ and ‘setters’ so that private data stays private. That would result in glacial performance if it were not for inlining, in which case it results in no performance degradation but allows you to have better control over your data members.

Now there can be too much of a good thing. A lot of the higher levels of optimization are only for very specific coding constructs and can actually result in slower code (or even subtle bugs) if not used properly. Always test every change you make, even changes in compiler optimization flags, and validate that your results are consistent (and correct!). There is a potential penalty for inlining code as well. Since the body of the inlined function is copied to the location in the code where the function was being called, it is quite possible to wind up with a massive binary executable if done without testing the tradeoffs. The C++ getters and setters should always be inlined (they are often very few instructions anyway, quite likely fewer instructions than what would be needed for a function call, potentially resulting in a smaller binary) as should any function/method that is used in the 5-10% of your code that is doing 90-95% of the work, but everything must be tested, ideally using code profiling as described below.

Do code profiling

If you have access to a code profiler, use it before and during optimization. There are two basic types, statistical sampling and absolute measurement. The sampler will test your program from time to time and see what routine it is executing. It is possible to wind up with false results if you have a lot of routines of various size and complexity (your program may actually be spending its time inside a very small routine, for instance, but is measured each time in another larger routine). The absolute measurements are taken by changing your code to report information, which can change the patterns of instruction and memory access. Whatever type you use, run it just before making changes and just after making changes on a data set as close to real-world as possible on a machine that is otherwise unloaded. It is best to take the time to understand why any change you made had the effect it did (even if it is nothing); otherwise you are just randomly making changes and hoping for the best. It is not unreasonable to take weeks to find the right combination (which might be the wrong combination with a different compiler/OS/hardware combination!); code optimization is not for the faint at heart.

Nitty gritty low-level things to think about

The above mentioned issues are basically language, compiler, OS and hardware independent. To tailor optimization for a particular machine setup there are further things to do. For instance, if you are accessing a large section of memory (that won’t fit in your caches), changing your patterns of access so that you minimize cache misses can result in order of magnitude performance increases. Regular cache misses are known as ‘thrashing the cache’ and exists on all levels of cache (from disk to level 1). If your pattern of memory access requires access to data that is not in the cache, you will flush some old data out of the cache. If, as soon as you have done so, now need that old data, you have to bring it back into the cache (and of course flush something else out). Since you lose many (hundreds, possibly thousands) of cycles each time you do that, you could easily spend 99% of your time swapping data in and out of the cache instead of actually doing processing. On older machines (and heavily loaded newer machines (and a lot of Windows boxes)) you see the disk cache being thrashed whenever large processes are competing (a lot of Windows programs terribly bloated (and, for whatever genius design consideration, even closed/deleted programs are retained in cache)). I am sure most people on Windows have experienced the critical point where so many applications are open that simply changing between them results in several seconds where you’re desktop appears frozen; that is when the cache is being exchanged from RAM to disk and back. That same apparent ‘freezing’ phenomenon happens when the instruction and data caches are full in the CPU. A whole lot of nuthin goin on. To address this problem you can pattern memory accesses so that they keep within the cache limits (caches have pages (or blocks), the size of the page gets smaller the closer you get to the CPU (you could consider the register a word-sized page if you like)). That may require modifying how you loop through data structures, splitting massive data structures so you don’t need to drag the entire thing into cache if you just need a little bit, sequentially completing processing on a given data structure before you move on to another, etc. There are caches for instructions just like caches for data and you can see the exact same problem if your program is jumping all over its code. The point of pipelining instructions (for quasi-parallel execution) is to increase throughput, if your pipeline stalls because you are constantly executing in different parts of your program you can wind up thrashing the instruction cache. It is rare, but still a very real potential problem.

Some C and C++ performance tips

You astute observers may have realized that you have read all this stuff about programming optimization from a C/C++ coder and I have yet to mention a single thing specific to C/C++. That is because you should not be considering code optimization until you have exhausted all the above (well, modifying patterns of memory access will require changing your code). If you have analyzed your algorithm and are convinced there is no way to improve it, have your compiler optimization turned on, are positive you have a CPU bound process, then and only then are you warranted to start looking at the code. There is plenty of potential here, I have been able to get order of magnitude increases in performance using the techniques below, but never start optimization by looking at the code (unless your boss likes you wasting your time).

Thanks for some suggestions from Scorpions4ever and infamous41md!

Inlined assembler

Generally speaking, if you have to result to writing in-lined assembler you are now stuck with a particular compiler (and version), OS and hardware combination. Be sure the payoff is there before using it. If you do use it, it should be the absolute minimum (there are fewer and fewer people that can even read assembler each day) and it is best to have the plain C/C++ code right next to it in comments.

Check the assembler output

Sometimes seemingly trivial changes in your source can have dramatic changes in how the compiler optimizes things. Though you can’t tie overall throughput 1:1 to instructions (pipelining doing what it does and all), generally speaking if your changes result in more instructions it probably results in slower code. (As an aside, some of the older instructions can actually result in better throughput due to pipelining when compared to the newer, supposedly faster, instructions, so using a lot of older instructions instead of a fewer newer instructions occasionally pays off.)

Performance and executable size

I have noticed that many developers have a mistaken idea that executable size has a relationship with performance. This may have been true in the 'old days' where the entire executable was required to be loaded into memory prior to execution (which would only have a start-up penalty, btw), but with today's modern operating system's demand paging, only those bits of the program that are actually executed are ever loaded into memory. Additionally, the OS will cache pages and share them between processes (this is how DLLs work), so if you launch more than one instances of your program there is essentially no startup cost associated with reading from disk (indeed, unless you do something to flush the OS cache, until you reboot the computer it is likely that the OS has retained an image of your program making the second and all subsequent startups free of disk access lag)). Additionally, optimal selection of instructions for performance often result in an increased number of them (hence an increase in the size of the binary) and of course loop unrolling and in-lining also 'bloat' the binary. As mentioned above, performance is generally not related to a program’s execution, it is often related to file/network IO, so shrinking your executable in an attempt to increase performance is typically a total waste of time.

Unless you are writing to run on embedded systems, where bytes often count (but continue to count less and less as memory gets cheaper by the day), worrying about exe size is very seldom a valuable use of your time. In the unlikely event that your exe, when zipped up, fails to fit on a floppy (does anyone still use floppies? (note that speed optimized binaries tend to compress very well)), then you may want to investigate some alternatives like dynamic linking (only useful if your customers have already received the libs, else you have to distribute them anyway). Of course, you could always pony up the few cents it takes (often as little or even less than floppies) to get a CD and burn it, giving you at least 650 MB (uncompressed) of space; heck you can even fit a few MS programs on single a CD! If you are distributing your program via the web, certainly having a huge file is problematic for those unfortunates still stuck with dialup (or worse), but you need to first be sure that your efforts are going to be worth while. How much can you expect to shrink your program? If you are using a lot of third-party libraries (like most anyone doing MFC/.Net is going to do), you may have very little control over your exe size except for making the choice of static vs. dynamic linking. Static linking is great because you never have to worry about the libraries your program needs changing unexpectedly helping to protect you from DLL hell.

With respect to minimizing code to try to fit the program in the instruction cache, the entire program need not fit in the instruction cache to see an increase in performance! Only the bits that are actually executing in a repetitive manner need to be in the cache (recall the 10/90 rule), the rest could even be on disk for all the impact on performance it will have. Because the instruction cache will be empty on initial load of the program (instruction caches are not, to my knowledge, shared amongst processes, only betwixt threads), if your program is only using a given cache line’s worth of instructions once, you won’t get any significant performance gain unless you have a huge number of instructions that are executed only once (thus, reducing the number of instructions will reduce the number of cache misses, potentially (though very unlikely, due to other IO dependencies) reducing the execution time).

You can reduce the size of your exe with compiler switches, which will strip out some unneeded symbol information and will bias the instruction selection to minimize them, eliminate loop unrolling, in-lining, (you will notice, I hope, that minimizing exe size almost always means eliminating run-time optimizations, resulting in slower execution), dynamic linking etc. Alternatively, going with the 10/90 rule, you can find the portions of your code that are compute bound, pull them into a separate file, optimize the compilation of that file and use these other tricks to reduce the size of the rest of the executable. It can be a lot of extra work and is seldom going to result in any measurable change in performance. However, if your distribution channel requires minimal executable sizes, this is a way to keep most of your cake and eat it also.

There are some “old wife’s tails” about how to reduce the size of your binaries. While older compilers did lack a lot of sophistication, unless you are stuck with some out-of-date compiler (and why, when excellent free compilers are available for download?), dead code (code that can never be executed) is eliminated, thus including a bunch of unused #include files will have no significant impact on the size of your program. Smart linkers (same caveat as above) will also eliminate chunks of program if they can detect that the code is not going to be used. The use of statically allocated memory vs. dynamic memory will also show no significant change in exe size or in performance. Indeed, if you are not careful with the use of statically allocated memory you can even increase the size of your binary (if you initiate your static memory at compile time, the compiler must make room for it in the binary). It is true that typical compilers put a whole lot of code in the program start up that can be eliminated with careful use of declarations (if you don’t use environment variables or command-line arguments, your program doesn’t need the code for that); the reduction in code is often not meaningful for performance. Keep in mind that modern OSs read disks in pages (generally at least 2K, sometimes 4K or more), so if you’re binary is less than a page you get absolutely no benefit in attempting to shrink it further.

A last note: if you are looking at the size of your executable when compiled in Debug mode, slap yourself upside your head and recompile in Release mode and take another look. Debug binaries, first of all, have essentially no optimization turned on (of any sort) and second, are chock full of symbols to aid in stepping through and analyzing your code. I have seen a two order of magnitude decrease in size going from Debug to Release mode when working with the Visual C++ compilers. Besides, you should never distribute a Debug binary, it requires the Debug DLLs to run, which requires the installation of the C++ compiler, probably not what your customers are interested in.

The Good book

After quite a bit of nagging from infamous41md I got the book “Computer Systems: A Programmer's Perspective” by Randal E. Bryant. I have several optimization books on my bookshelf, but this is the best by far. Though it is not specifically about optimization (it does cover it a great deal, though), the background on how and why things work are indispensable for anyone interested in optimization. It specifically targets C with a sprinkling of C++, but most programmers should be able to at least read and follow the examples, so even if you are a Perl or PHP programmer I still think it is dollars well spent to get this book.
Keith (mitakeet) Oxenrider
October 4, 2004
koxenrider[at]sol[dash]biotech[dot]com
http://sol-biotech.com/code/PerformanceProgramming.html