C++ low-level optimization tips

tenpn
  • C++ low-level optimization tips tenpn

    Assuming you already have the best-choice algorithm, what low-level solutions can you offer for squeezing the last few drops of sweet sweet frame rate out of C++ code?

    It goes without saying that these tips only apply to that critical code section that you've already highlighted in your profiler, but they should be low-level non-structural improvements. I've seeded an example.

  • Remove unnecessary branches

    On some platforms and with some compilers, branches can throw away your whole pipeline, so even insignificant if() blocks can be expensive.

    The PowerPC architecture (PS3/x360) offers the floating-point select instruction, fsel. This can be used in the place of a branch if the blocks are simple assignments:

    float result = 0;
    if (foo > bar) { result = 2.0f; }
    else { result = 1.0f; }
    

    Becomes:

    float result = fsel(foo-bar, 2.0f, 1.0f);
    

    When the first parameter is greater than or equal to 0, the second paramter is returned, else the third.

    The price of losing the branch is that both the if{} and the else{} block will be executed, so if one is an expensive operation or dereferences a NULL pointer this optimisation is not suitable.

    Sometimes your compiler has already done this work, so check your assembly first.

    Here's more information on branching and fsel:

    http://assemblyrequired.crashworks.org/tag/intrinsics/

  • Use SIMD (by SSE), if you don't do already. Gamasutra has a nice article on this. You can download the source code from the presented library at the end of the article.

  • Don't overlook your compiler -- if you are using gcc on Intel, you could easily get a performance gain by switching to the Intel C/C++ Compiler, for example. If you are targeting an ARM platform, check out ARM's commercial compiler. If you are on the iPhone, Apple just allowed Clang to be used starting with the iOS 4.0 SDK.

    One issue that you will probably come into with optimization, especially on the x86, is that a lot of intuitive things end up working against you on modern CPU implementations. Unfortunately for most of us, the ability to out optimize the compiler is long gone. The compiler can schedule instructions in the stream based on it's own internal knowledge of the CPU. In addition, the CPU can also re-schedule instructions based on it's own needs. Even if you think of an optimal way to arrange a method, chances are the compiler or CPU has already come up with that on it's own and has already performed that optimization.

    My best advice would be to ignore the low-level optimizations and focus on the higher level ones. The compiler and CPU can't change your algorithm from an O(n^2) to an O(1) algorithm, no matter how good they get. That's going to require you to look at exactly what you are trying to do and find a better way to do it. Let the compiler and CPU worry about the low level and you focus on the mid to high levels.

  • Remove unnecessary virtual function calls

    The dispatch of a virtual function can be very slow. This article gives a good explaination of why. If possible, for functions that are called many many many times per frame, avoid them.

    You can do this in a couple of ways. Sometimes you can just rewrite the classes to not need inheritence - maybe it turns out that MachineGun is the only subclass of Weapon, and you can amalgamate them.

    You can use templates to replace run-time polymorphism with compile-time polymorphism. This only works if you know the subtype of your objects at runtime, and can be a major rewrite.

  • Optimise your data layout! (This applies to more languages than just C++)

    You can go pretty deep making this specifically tuned for your data, your processor, handling multi-core nicely, etc. But the basic concept is this:

    When you are processing things in a tight loop, you want to make the data for each iteration as small as possible, and as close together as possible in memory. That means the ideal is an array or vector of objects (not pointers) that contain only the data necessary for the calculation.

    This way, when the CPU fetches the data for the first iteration of your loop, the next several iterations worth of data will get loaded into the cache with it.

    Really the CPU is fast and the compiler is good. There's not really much you can do with using fewer and faster instructions. Cache coherence is where it's at (that's a random article I Googled - it contains a good example of getting cache coherency for an algorithm that doesn't simply run through data linearly).

  • A very, very low-level tip, but one that can come in handy:

    Most compilers support some form of explicit conditional hinting. GCC has a function called __builtin_expect which lets you inform the compiler what the value of a result probably is. GCC can use that data to optimize conditionals to perform as quickly as possible in the expected case, with slightly slower execution in the unexpected case.

    if(__builtin_expect(entity->extremely_unlikely_flag, 0)) {
      // code that is rarely run
    }
    

    I've seen a 10-20% speedup with proper use of this.

  • My basic principle is: don't do anything that's not necessary.

    If you've found that a particular function is a bottleneck, you could optimize the function -- or you could try to keep it from being called in the first place.

    This does not necessarily mean you're using a bad algorithm. It might mean that you are running calculations every frame that could be cached for a short while (or entirely precalculated), for example.

    I always try this approach before any attempts at really low-level optimization.

  • Reduce boolean expression evaluation

    This one is really desperate, as it's a very subtle but dangerous change to your code. However if you have a conditional that is evaluated an inordinate number of times, you can reduce the overhead of boolean evaluation by using bitwise operators instead. So:

    if ((foo && bar) || blah) { ... } 
    

    Becomes:

    if ((foo & bar) | blah) { ... }
    

    Using integer arithmetic instead. If your foos and bars are constants or evaluated before the if(), this could be faster than the normal boolean version.

    As a bonus the arithmetic version has less branches than the regular boolean version. Which is another way to optimize.

    The big downside is that you lose lazy evaluation - the whole block is evaluated, so you can't do foo != NULL & foo->dereference(). Because of this, it's arguable that this is hard to maintain, and so the trade-off may be too great.

  • Keep an eye on your stack usage

    Everything you add to the stack is an extra push and construction when a function is called. When a large amount of stack space is required, it can sometimes be beneficial to allocate working memory ahead of time, and if the platform you're working on has fast RAM available for use - all the better!

  • Use Compiler Intrinsics.

    Ensure that the compiler is generating the most efficient assembly for certain operations by using intrinsics - constructs that look like function calls that the compiler turns into optimized assembly:

    Here's a reference for Visual Studio, and here's one for GCC

  • First step: Think carefully about your data in relation to your algorithms. O(log n) is not always faster than O(n). Simple example: A hash table with only a few keys is often better replaced with a linear search.

    Second step: Look at the assembly generated. C++ brings a lot of implicit code generation to the table. Sometimes, it sneaks up on you without you knowing.

    But assuming it's really pedal-to-the-metal time: Profile. Seriously. Randomly applying "performance tricks" is about as likely to hurt as it is to help.

    Then, everything depends on what your bottlenecks are.

    data cache misses => optimize your data layout. Here's a good starting point: http://gamesfromwithin.com/data-oriented-design

    code cache misses => Look at virtual function calls, excessive callstack depth, etc. A common cause for bad performance is the erroneous belief that base classes must be virtual.

    Other common C++ performance sinks:

    • Excessive allocation/deallocation. If it's performance critical, don't call into the runtime. Ever.
    • Copy construction. Avoid wherever you can. If it can be a const reference, make it one.

    All of the above are immediately obvious when you look at the assembly, so see above ;)

  • Avoid memory accesses and especially random ones at all costs.

    That's the single most important thing to optimize for on modern CPUs. You can do a shitload of arithmetic and even a lot of wrong predicted branches in the time you wait for data from RAM.

    You can also read this rule the other way around: Do as much calculations as possible between memory accesses.

  • The first thing you need to understand is the hardware you're running on. How does it handle branching? What about caching? Does it have a SIMD instruction set? How many processors can it use? Does it have to share processor time with anything else?

    You may solve the same problem in very different ways - even your choice of algorithm should be dependent on the hardware. In some cases O(N) can run slower than O(NlogN) (depending on implementation).

    As a crude overview of optimisation, the first thing I would do is look at exactly what problems and what data you are trying to solve for. Then optimise for that. If you want extreme performance then forget about generic solutions - you can special case everything that doesn't match your most used case.

    Then profile. Profile, profile, profile. Look at memory usage, look at branching penalties, Look at function call overhead, look at pipeline utilisation. Work out what is slowing down your code. It's probably data access (I wrote an article called "The Latency Elephant" about the overhead of data access - google it. I can't post 2 links here as I don't have enough "reputation"), so closely examine that and then optimise your data layout (nice big flat homogeneous arrays are awesome) and data access (prefetch where possible).

    Once you've minimised the overhead of the memory subsystem, try and determine if instructions are now the bottleneck (hopefully they are), then look at SIMD implementations of your algorithm - Structure-of-Arrays (SoA) implementations can be very data and instruction cache efficient. If SIMD isn't a good match for your problem then intrinsics and assembler level coding may be needed.

    If you still need more speed then go parallel. If you have the benefit of running on a PS3 then the SPUs are your friends. Use them, love them. If you've already written a SIMD solution then you'll get a massive benefit moving to SPU.

    And then, profile some more. Test in game scenarios - is this code still the bottleneck? Can you change the way this code is used at a higher level to minimise its usage (actually, this should be your first step)? Can you defer calculations over multiple frames?

    Whatever platform you're on, learn as much as you can about the hardware and the profilers available. Don't assume that you know what the bottleneck is - find it with your profiler. And make sure you have a heuristic to determine if you have actually made your game go faster.

    And then profile it again.

  • Const everything!

    The more information you give the compiler about the data the better the optimizations are (at least in my experience).

    void foo(Bar * x) {...;}
    

    becomes;

    void foo(const Bar * const x) {...;}
    

    The compiler now knows that the pointer x is not going to change and that the data it is pointing to will not change either.

    The other added benefit is that you can reduce the number of accidental bugs, stopping yourself (or others) modifying things that they shouldn't.

  • Most often, the best way to gain performance is to change your algorithm. The less general the implementation the closer you can get to the metal.

    Assuming that has been done....

    If it's really critical code indeed, try to avoid memory reads, try to avoid calculating stuff that can be precalculated (though no lookup tables as they violate rule number 1). Know what your algorithm does and write it in a way that the compiler knows it too. Check the assembly to be sure it does.

    Avoid cache misses. Batch process as much as you can. Avoid virtual functions and other indirections.

    Ultimately, measure everything. The rules change all the time. What used to speed up code 3 years ago now slows it down. A nice example is 'use double math functions instead of float versions'. I wouldn't have realized that one if I hadn't read it.

    I forgot - don't have default constructors intialize your variables, or if you insist, at least also create constructors that don't. Be aware of the things that don't show up in the profiles. When you lose one unnecessary cycle per line of code nothing will show up in your profiler, but you'll lose a whole lot of cycles overall. Again, know what your code is doing. Make your core function lean instead of foolproof. Foolproof versions can be called if needed, but are not always needed. Versatility comes at a price - performance being one.

    Edited to explain why no default initialization: A lot of code says: Vector3 bla; bla = DoSomething();

    The intialization in the constructor is wasted time. Also, in this case the wasted time is small (probably clearing the vector), however if your programmers do this habitually it adds up. Also, a lot of function create a temporary (think overloaded operators), that gets initialized to zero and assigned after straight away. Hidden lost cycles that are too small to see a spike in your profiler, but bleed cycles all over your code base. Also, some people do a lot more in constructors (which is obviously a no-no). I've seen multi-millisecond gains from an unused variable where the constructor happened to be a bit on the heavy side. As soon as the constructor causes side effects the compiler won't be able to optmize it out, so unless you never use above code, I prefer either a non-initializing constructor, or, as I said, a version of the constructor that doesn't initialize.

    Vector3 bla(noInit); bla = doSomething();

  • The restrict keyword is potentially handy, especially in cases where you need to manipulate objects with pointers. It allows the compiler to assume the pointed-to object is not going to get modified in any other way which in turn allows it to perform more aggressive optimisation such as keeping parts of the object in registers or reordering reads and writes more effectively.

    One good thing about the keyword is that it's a hint which you can apply once and see benefits from without rearranging your algorithm. The bad side is that if you use it in the wrong place, you might see data corruption. But usually it's quite easy to spot where it's legitimate to use it - it's one of the few examples where the programmer can reasonably be expected to know more than the compiler can safely assume, which is why the keyword has been introduced.

    Technically 'restrict' doesn't exist in standard C++, but platform-specific equivalents are available for most C++ compilers, so it's worth considering.

    See also: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html

  • Minimize dependency chains to make better use of the CPU pipleline.

    In simple cases the compiler may do this for you if you enable loop unrolling. However it often won't do it, especially when there's floats involved as reordering the expressions changes the result.

    Example:

    float *data = ...;
    int length = ...;
    
    // Slow version
    float total = 0.0f;
    int i;
    for (i=0; i < length; i++)
    {
      total += data[i]
    }
    
    // Fast version
    float total1, total2, total3, total4;
    for (i=0; i < length-3; i += 4)
    {
      total1 += data[i];
      total2 += data[i+1];
      total3 += data[i+2];
      total4 += data[i+3];
    }
    for (; i < length; i++)
    {
      total += data[i]
    }
    total += (total1 + total2) + (total3 + total4);
    

Tags
c++ optimization
Related questions and answers
  • I'm creating a component-based game object system. Some tips: GameObject is simply a list of Components. There are GameSubsystems. For example, rendering, physics etc. Each GameSubsystem contains... a decision which Components to register (and how to organize them). For example, GameSubsystemRender can register Renderable Components. pro. Components know nothing about how they are used. Low... GameSubsystems) can implement registerComponent(Component*). pro. Components and GameSubystems know nothing about each other. con. In C++ it would look like ugly and slow typeid-switch

  • I am looking for a library which would easily allow me to render shapes (cubes, spheres, lines, circles, etc.) in 3D3 and OpenGL if possible. I want to be able to rapidly design visual debugging tools and I am not proefficient enough in graphics rendering to do it myself (writing the low level stuff that is). The library would have to be for C/C++. I've already taken a look at the open-source 3d engine, but I feel those are too big for what I really need. Do any of you know if such library exist? If so, links would be appreciated!

  • requirements. I've got it so that I can specify the number of XP needed for the last level up, but I want to be able to control the XP needed for the first level up, which in this case can differ wildly. For example, if I have 40 levels, and 1,000,000 XP for the last level, the first level up requirement is then 625. But if I change the levels to 80, the first level up becomes 156. In both cases...; if (levels < 2) levels = 2; int experience_for_last_level = 1e6; float fraction = 1.0 / levels; { int i = 0; float fraction_counter = fraction; int counter

  • I am a second year student of Computer Games Technology. I recently finished my first prototype of my "kind" of own pathfinder (that doesn't use A* instead a geometrical approach/pattern recognition, the pathfinder just needs the knowledge about the terrain that is in his view to make decisions, because I wanted an AI that could actually explore, if the terrain is already known, then it will walk... tips are welcome. I am specifically looking for good books, because it is really hard to find good books on this topic. There are some small articles out there like this one, but still isn't enough

  • I'm attempting to program an offline puzzle game in C++. I'm trying to figure out the best way (particularly during the alpha) to report back to me Whether people finish the level Where people died How long they took on the level The current version that they are playing I don't currently have a public database server which I could dump it in to. Is there any already written libraries that (for instance) drop analytics into a Google App Engine Database or Google Docs Spreadsheet? What approach (if any) have you used for storing analytics?

  • Textures Vertex and Pixel "shaders" Some representation of drawing state (blending modes etc...) In other words a fairly low level interface where my code to draw for example an animated model would use the interface to obtain abstract vertex buffers etc. I worry though that it's too low level to abstract out all the functionallity I need efficiently. The alternative is to do this at a higher... and likely won't turn out to be an actual game, I'll be happy to make a nice graphics demo and learn about 3d rendering and c++ design. My intent is to use direct3d9 for rendering as I have some little

  • are where. Used by the game class. A block class that consists of the current falling block, used by game. A renderer class abstracting low level SDL to a format where you render "tetris blocks". Used...I am creating a tetris game in C++ & SDL, and I'm trying to do it "good" by making it object-oriented and keeping scopes small. So far I have the following structure: A main with some lowlevel... the current block locks in place and a new block spawns? I also have an other unrelated question, is there some place where you can find some standard data on tetris like standard score tables, rulesets

  • don't need 3D, and it is quite ...a lot of work to learn. I also don't like that it is so expensive to use for different platforms and that I can only code for it through scripting. You have to buy each... c++, I know that there might be some wrappers available, but I don't like to use wrappers, because... Irrlicht. A lot of features, but support seems to be low and it is aimed at enthusiasts. C... zoomable maps. I saw that Adobe has a new 3D API for Flash, but I don't know if that improves 2D performance aswell, I couldn't find anything related to that question on their MAX10 sessions. Would you say

  • In an example for a little game framework (that does work), there are the lines: ID3D10Device* device; ID3D10Buffer* pBuffer; followed by the line device()->CreateBuffer(&bd, NULL, &pBuffer()); The third argument requires a pointer, so why do you need to put an ampersand before pBuffer? an ampersand gets a memory address right? But pBuffer is already a pointer so what does the ampersand do to it?