Background Information

You may have read the page on Interpreting Results, but still are unclear on what really goes on behind the scenes to cause the differences you have seen. This is intended to give you some background on what happens and how it impacts on performance.

There are three different export tests in the PPBM5 benchmark.

Let's go through each case and explain what happens in terms of workload, so you understand what may be worthwhile to investigate for further improvement of your system, but keep in mind that when testing acutal sequences with life footage, no single test will ever use just use one single component of interest, be that CPU, memory, disk or video card. It will always be a combination of all of these components working together, and the interaction between memory, CPU, GPU and disk can have non-obvious effects. In one test the emphasis will be one component, in another test the emphasis will be on another component. This is what makes interpreting the results so difficult. Hover the mouse over one of the topics below to expand it.

Disk Test, Export DV AVI to DV AVI

This is a simple test that uses plain MS DV AVI type II PAL clips, without effects, transitions or modifications, exported in the same format. It uses nearly 550 instances of the same clip, to be exported to one large AVI file. Because there is nothing modified at all, there is no MPE involvement.

All the work is done by the CPU, the memory and the disk(s).

Get the first clip instance from disk and store it in memory, get the next instance from disk and store it in memory, etc. This will fill up memory rapidly, so at a certain moment memory needs to be freed and written to disk. Then the next bunch of instances are handled and the process repeats itself. BTW, the multi-threading in this case is far from optimal, but is expected to be corrected in a future update.

The CPU has a relatively easy job. It is the supervisor that tells his subordinates (memory and disk(s)) what to do. In this case the disk(s) cause the waiting, because they are the slowest components in the chain. Memory is much faster and can easily keep up with the disk activity and MPE is out of a job on this task, sitting on the fence watching the other components work up a sweat.

What is the lesson to be derived:

  • The faster the disk(s) and the larger the disk cache, the better. This is an area where large Raid arrays show their advantage, especially if helped with large cache memory on the Raid controller.

Of course more cores and higher clock speed do help too, but not as much one would hope for.

CPU / Memory Test, Export a mixed timeline to MPEG2-DVD

The source material is heavily mixed, it comprises DV AVI in PAL, HDV 1080i PAL, XDCAM-EX HQ PAL and AVCHD 1080i NTSC. The source is exported to MPEG2-DVD NTSC High Quality Widescreen. It is loaded with lots of effects and transitions, a lot of them keyframed with bezier curves and up to 4 tracks in use.

This means that on export there is a lot of scaling and rendering, as well as field reversal from UFF to LFF for some sources. In all, a nightmarish timeline to export.

Time for the MPE to get off it's exalted fence and get to work.

We are talking about compression to MPEG2, which is not very hard on the CPU. It is only moderately compressed, so threading can resolve most of the processing required by the CPU in a few steps, before it hands off the results to the RAM, which hands it over to the MPE for rendering and scaling, which in turn hands it back over to RAM and then is burst to the disk(s). However, while waiting for the disk(s) to finish writing, the RAM memory is also holding frames from the source material that the CPU still needs to process. The more memory is available, the more frames can be held there for faster processing.

The basic ingredients here are the amount of RAM, number of cores and the clock speed.

More RAM means more frames in memory, more cores means faster processing by the CPU and getting data out of the queue in RAM and a higher clock speed means everything will go faster. If the amount of RAM is limited, the speed of the disk(s) can be the bottleneck.

The difference between hardware or software assisted MPE encoding.

If hardware assisted MPE is enabled, there is a lot of traffic from RAM to the CUDA card, to VRAM and back, which causes delays. For this test there is a lot of scaling and rendering going on and the CUDA card always uses maximum quality settings, in contrast to software only. So, software only leaves out the latency of RAM - GPU - VRAM communication, which means less overhead and it does not by default use maximum quality.

It shows the effectiveness of hardware CUDA/MPE that, despite the latency overhead and the maximum quality, the performance penalty is limited to 30 - 40% on very fast systems. The slower the system, the smaller the performance penalty and it may even become a performance and quality benefit using CUDA/MPE.

Adding memory makes a huge difference!!

Performance gains of 50% or more when doubling memory are not uncommon.

The impact of clock speed decreases with more memory installed.

CPU / Memory Test, Export a mixed timeline to H.264-BR

The source material is heavily mixed, it comprises DV AVI in PAL, HDV 1080i PAL, XDCAM-EX HQ PAL and AVCHD 1080i NTSC. The source is exported to H.264-BR HDTV 1080i 29.97. It is loaded with lots of effects and transitions, a lot of them keyframed with bezier curves and up to 4 tracks in use. This means that on export there is some scaling and rendering, as well as field reversal from LFF to UFF for some sources.

Here we are talking about H.264 compression, one of the most complex and taxing codecs for a computer in these days. The main difference in comparison to MPEG2 is that the CPU takes many more steps to process data because of the more complex decoding. The CPU load is much higher and the threading takes longer before data are handed off to RAM, even when the hyper-threading is particularly good. Meanwhile the next data to be processed are loaded into RAM. When the CPU is finished on the first block of data it hands it back to RAM and loads the next set of data. Now this comprises both algorithm data and frame data, so there is a lot of traffic on the road between CPU via cache to RAM, and vice versa. Occasionally traffic may halt and just like traffic jams, there is not always a clear reason what causes the traffic jam.

It can be that all the logical cores are crunching along, or the cache on the CPU is depleted, or the memory controller needs a break, or the RAM is still waiting for data from the pagefile, a whole lot of reasons.

In contrast to the MPEG2-DVD test, where all material needed to be scaled down from 1920 x 1080 or 1440 x 1080 to 720 x 480, this test only uses scaling for some DV and HDV material, so there is less handing off data packets to the GPU for MPE scaling, reducing the latency on the route from RAM to GPU to RAM to disk.

Because of the high compression efficiency of the H.264 codec, the discriminating factor is the speed with which algorithm data and frames are handed over to the CPU and its cache.

The basic ingredients here are the speed of RAM, number of cores and the clock speed.

H.264 encoding is like a highway, where the number of lanes available depict the number of CPU cores, the speed limit depicts the clock and memory speed, the number of vehicles on the highway depicts the amount of CPU cache used. The more lanes (cores) the better, a higher speed limit (clock and memory speed) can improve traffic flow, the more vehicles (cache used), the bigger the chance of traffic jams.

Interchanges can also cause traffic jams, depicting dual processor setups.

Dual processors need to continuously monitor each other and communicate about progress. It is like reducing the number of lanes on the highway by half. That often causes traffic jams, unless traffic is light.





General Hardware Recommendations

The more cores, cache and the higher the clock speed, the better. Intel processors are preferred over AMD, that lack SSE 4.1+ support, which is heavily used during CPU intensive (read AVCHD, MPEG taks). The basic entry point level for new systems appears to be the i7-930/950 for economical systems and the i7-970/980X for more high-end systems. i7 systems from any series below the 9xx series are not advised, let alone the i3 or i5 series. They suffer too much from the memory controller, the chipset on the motherboard or the lack of Hyper Threading.

The more capabilities to adjust clock speed and memory speed, the better. Overclocking can lead to substantial gains and HP / Dell and the like do not allow that. Be warned. The major brands suffer from lack of overclock ability, fixed (low) memory speed and configurations unsuitable for editing. Better build yourself or turn to a reputable custom builder with demonstrated expertise in video editing. X58 motherboards are currently the best choice.

Definitely use a CUDA/MPE capable video card. It can reduce rendering time by a factor 10 and assists with scaling on export, while improving export quality. SLI is no consideration, since it is not supported. For the time being ATI is out of the game and only nVidia cards with 1 GB+ video memory are worth considering.

Specifically for MPEG encoding, the amount of memory is critical. The more the better. 24 GB is far better than 12 GB. The faster the memory, the better. First is rating (1600 or 1866), then CAS latency. Use at least 12 GB but preferably even more. To use the faster memory, BIOS adjustments are required.

The faster the disk(s), the better. Raids do improve performance. Notice that all Top 20 Performers use Raid configurations and sometimes even multiple Raids. Even SSD's, though widely touted for their speed, benefit significantly from Raid configurations.