Benchmarks

This chapter reports on the building and measurement of several media processing kernels. First I ignore automatic specialization and compare the manual implementation of the alternative strategies (interpreted, buffered, and specialized) from Section tradeoffs. Then I present the benchmark data of the residual code from Nitrous and Simple, the two implementations of bit-addressing. A partial implementation (cyclic values only) is described briefly in Appendix ba.


Except as noted below, the methodology is as follows: I use GCC v2.7.2 with the -O1 option. Although I collected some data with the -O2 option, it was not significantly different, so I discarded it.

Each of the examples was run for 1000 iterations, while real elapsed time was measured with the gettimeofday system call. The whole suite was run five times, and the best times were taken.

The legend of each bar-chart indicates which data-sets come from which machines. Four machines were used to collect the data:

R4k
SGI Indigo 2 with 150Mhz R4400 running IRIX 5.3.
P5
IBM Thinkpad 560 with 133Mhz Pentium running Linux 2.0.27.

The height of a bar indicates the ratio of the two execution times specified in the caption.

The software is available from http://www.cs.cmu.edu/~spot. The distributions contain the source code for all the programs measured below.

Manual

The graphs in Figure genspec compare specialization to interpretation and buffering. The general, unspecialized versions here are built on load_sample directly (like sum in Figure sum-signal) rather than with the interface in Figure sig. As expected, the unspecialized programs run many times slower than their specialized counterparts.

The buffered code uses typical operations on vectors of whole-word integers such as multiply every member by a scalar, or add one vector into another. It also uses encode and decode routines that copy a signal from its packed representation into a whole-word vector. These routines have a special case for bytes, otherwise they call load/store_sample. The vectors are 200 words long. Cs68 uses three passes, and rgb2m1 uses ten. The code appears in Appendix manual.

[ThoDa95] analizes buffered audio synthesis when all the data fit into an on-chip cache. They find (and this is corroborated by anecdotal evidence) that buffering reduces bandwidth by about 30%, and the cost is fairly independent of the number of passes over the buffers. This data set indicates that the cost can be more than 200% and may grow with the number of passes. The lower overheads may result from using application-specific vector primitives. In other words, some amount of manual specialization has already taken place.


ps ps

Figure genspec: The graph on the left shows the execution time of general code normalized to specialized code. The right shows buffered code normalized to specialized code.

Next, I compare using char* pointers and reading individual bytes to reading whole words and using shift and masks (as bit-addressing does). Figure bitsw shows that reading words runs slightly faster despite using more instructions. I expect that these results are rather dependent on the microarchitecture. The sum8x4 and sum8x8 benchmarks add up the members of a contiguous vector of bytes. The byte-loop is unrolled four and eight times, respectively (the word version is always unrolled four times). The rgb2m3 benchmark is explained below.


ps

Figure bitsw: Execution time of byte-loads normalized to word-loads with shifts and masks.

Nitrous

This section presents the a small data-set from the Nitrous system. It compares bit-addressing and automatic specialization to hand-written code. The programs of Figure nitrous-times were implemented in root, including the cache and signal library. All the residual root code is translated to one large C procedure by using GCC's indirect-goto extension for jumps. These examples were run on 2500 bytes of data instead of 4000.

nybble4
sum vector of 4-bit samples
nybble12
sum vector of 12-bit samples
filter3
filter word-vector with kernel of size three.
filter7
filter word-vector with kernel of size seven.


ps

Figure nitrous-times: Execution time of root code automatically specialized by Nitrous normalized to time of hand-specialized code.

Simple

In order to scale-up the examples I built Simple, an online specializer that does not use shift/reset or continuations and restricts dynamic control flow to loops (i.e. sum and arrow types are not fully handled). All procedure calls in the source programs are expanded, but the input language is extended with a while-loop construct that may be residualized: {ml {Exp} ::= ... {or} {c loop} {Var} {Exp} {Exp} {Exp} {Exp}
}

which is equivalent to the definition and application of the recursive procedure: {code
let fun G {Var} = if {Exp} then {Exp} else G {Exp}
in G {Exp} end
}

The loop construct is specialized according to the dynamic conditional heuristic and memoization: it is inlined until the predicate is dynamic, then the loop is entered and unrolled until the predicate is dynamic again. At this point, the static part must match the static part at the previous dynamic conditional.

As described in Chapter pe, because Simple does not perform let-insertion, it may duplicate or forget side-effects. Since GCC performs common subexpression elimination, most but not all of the duplication is eliminated.

The input to Simple is also the Sal language, but with an ML-like syntax and the above restrictions. Examples of its use appear in Figures assumptions and fircode, and Appendix benche. The information is specified with unique names (constants really) created with a quote syntax. The names are ordinary first-class values. The clone primitive copies a data-structure including sharing information and creates a new value with the same pattern of sharing internally, but that doesn't share with anything else.

Simple differs from Nitrous in several places. In Nitrous, a generated compiler knows the shapes of the dynamic values, which are the names of their location in the residual code. Early equality works by comparing these locations. In Simple, a dynamic value is associated with an expression. Early equality works by textual equality of these expressions.

Simple does not provide premultiplied cyclic values (see Section cyc-rep). Nor does it use late normalization (see Section npi).

Simple is implemented in SML/NJ without concern for execution speed. The specializer requires fractions of a second to produce the examples presented here.

Example

The main example built with Simple is an audio/vector library. It provides the signal type, constructors that create signals from scalars or sections of memory, combinators such as creating a signal that is the sum of two other signals, and destructors such as copy and reduce. The vector operations are suspended in constructed data until a destructor is called. Figure fir contains a graphical representation of this kind of program.

Figure sig gives the signature for part of the library. The semantics and implementation are mostly trivial, most of the code appears in Appendix signals. One exception is that operations on multiple signals use a conjunction of the end-tests, as described in Section multiple. Consider the end-test of an infinite signal such as a constant or noise (a signal of pseudo-random numbers). It should return true because these signals can end anywhere, rather than returning false because they never end.


sig
  type samp
  type signal
  type baddress
  type binop = samp * samp -> samp

fun vector_signal: baddress * baddress * int * int -> signal fun constant: samp -> signal

fun map: (samp -> samp) * signal -> signal fun map2: binop * signal * signal -> signal fun delay1: signal * samp -> signal fun scan: signal * samp * binop -> signal fun lut: baddress * signal -> signal fun sum_tile: samp * signal * int -> signal

fun copy: signal * signal -> unit fun reduce: signal * binop * samp -> samp

fun filter: signal * (samp list) * (samp list) -> signal fun fm_osc: signal * int * baddress * int * signal * int -> signal end


Figure sig: Signature for signal library.

This delay operator returns a signal of the same length as its input, thus it loses the last sample of the input signal. The other possibility (that it returns a signal one longer) requires sum-types because there would be a dynamic conditional in the next method.

The filter combinator is built out of a recursive series of delays, maps, and binops. Similarly, the wavetable combinator is built from simpler components. Lut transforms a signal by looking up the samples in an array in memory where the translations are stored one per word. A general version that used another signal as the look-up-table would require a translate method, not shown here. Translate is also required for the 2D graphics interface, but I have not yet finished with it so I cannot present it here.

A feedback combinator can be made in a higher-order language. Show the code.


With this system, interleaved vectors are stored in the same range of memory, Figure combined(c) is an example of three interleaved vectors. With an ordinary vector package, if one passes interleaved vectors to a binary operation, then each input word is read twice. An on-chip hardware cache makes this second read inexpensive, but with the software cache the situation is detected once at code-generation time. Specialization with sharing can replace a cache hit with a register reference.

Collapsible protocol languages such as [HaRe96, ProWa96] can handle more advanced control flow (our signals are push or pull, not both), but these systems do not address bits. The same is true of synchronous real-time languages like Signal [GuBoGaMa91] and Lustre [CPHP87]. Their compilers are mostly concerned with deriving global timing from local behavior.

Past work in bit-level processing has not emphasized implementation on word machines. Although C allows one to specify the number of bits used to store a field in a structure, it does not provide for arrays of bits. The hardware design language VHDL [IEEE91]) allows specification of signals of particular numbers of bits, but lacks a compiler that produces fast code.


There are two groups of examples, the audio group (Figure simaud) and the video group (Figure simvid). The audio group uses 2000-byte buffers and 16-bit signals; the video group uses 4000-byte buffers and mostly 8-bit signals.

The graphs show the ratio of the execution time of the code generated by Simple to manually written C code. The data indicate that, with some exceptions, the runtimes are comparable. I suspect the exceptions result from aliasing preventing GCC's CSE.

In the audio group, this code was written using short* pointers and processing one sample per iteration. In the video group, the code was written using whole-word memory operations and immediate-mode shifts/masks. Some of the code appears in Appendix manual.

Some of the static information used to create the specialized loops appears in Appendix benche. These are generally arguments to the interpreter copy, which is used for all the audio examples. The video examples also use copy, except iota, sum, and sum12.

The audio examples operate on sequential aligned 16-bit data unless noted otherwise:

inc
add 10 to each sample.
add
two signals to form a third.
filter2
filter with kernel width 2.
filter5
filter with kernel width 5. The manual code doesn't unroll the inner loop over the kernel.
fm1
a one oscillator wave-table synthesizer.
fm2
a one-in-one oscillator wave-table synthesizer.
lut
a look-up table of size 256. The input signal is 8-bits per pixel.
sum
all the samples in the input
wavrec
an wave-table synthesizer with feedback.

The video examples operate on sequential aligned 8-bit data unless noted otherwise:

copy
no operation.

gaps
destination signal has stride 16 and size 8.

cs68
converts binary to ASCII by reading a six-bit signal and writing eight.

cs86
ASCII to binary by reading eight and writing six.

iota
fills bytes with 0, 1, 2, ...

sum
as in Figure sum-signal, specialized as in Figure fast-sum

sum12
a twelve-bit signal, as in Figure twelve.

rgb2m1, rgb2m2, rgb2m3
convert color to monochrome. The input pixels are layed out as in Figures rgbm1, rgbm2, and rgbm3d, respectively. The output data are sequential bytes in each case.


ps

Figure simvid: Video group. Execution time of automatically specialized by Simple normalized to time of hand-specialized code.


ps

Figure simaud: Audio group. Execution time of code automatically specialized by Simple normalized to time of hand-specialized code.