Dithering and Per-pixel Ops

Dithering makes a good example of how cogen can make an abstract implementation run fast. This section sketches a procedure that dithers a 24-bit rgb image down to any size color-cube. The rgb image's channels may have any linear organization. A software cache converts loop code with byte-loads into code with word loads, shifts, and masks. It works by keeping the low two bits of a loop index static. The ditherer is built from the following components: a 1D linear looper, the software cache, a hash-noise dither, and a byte permuter.

Why RTCG: The size of the destination color-cube depends on how many unallocated colors are available in the X server, and thus isn't available until runtime. Furthermore the user may want to display images from different sources with different channel layouts.

In this section, the partially-static integers and the operations on them are manually. This is a new [i've never seen it, have you?] form of binding time improvement. It seems likely that applying the idea behind partially static structures to partially static bit-fields could automate this division. Eventually an analogue of variable splitting might be used to further optimize a number of bit-preserving operations.

If x is a dynamic variable, then x2 is the static low two bits.

; index and offset.  one offset per channel.
loop((i, i2), ((o, o2) ...)) =
  if (i2 == max2 and
      i == max)
    return;
  loop
    b = byte-load-cached(heap-p, o, o2);
  f(b, ...);
  loop(plus((i,i2),(0,1)), (plus((o,o2),step2) ...))

plus((a, a2), (b, b2)) =
  e = a2 + b2;
  f = a + b;
  if (e > m)
    (f + 1, e - m)
  else
    (f, e)

; state: (p0, wo0, w0)
; records previous heap pointer, word offset, byte offset
; when is the state initialized???
byte-load-cached(p, wo, bo) =
  if (p = p0 and wo = wo0)
      w = w0
    else
      w0 = w = word-load(p, wo)
      wo0 = wo
      p0 = p
  b = byte(w, bo)

Elimination of the cache test requires variable splitting and improving the binding time of the eq? function so that (eq? v v) --> #t.

When the compiler runs it could produce eg:

loop(i, or, og, ob) =
  w = word-load(heap-p, or);
  r = byte(w, 0);
  g = byte(w, 1);
  b = byte(w, 2);
  f(r, g, b);
  r = byte(w, 3);
  w = word-load(heap-p, or);
  g = byte(w, 0);
  b = byte(w, 1);
  f(r, g, b);
  r = byte(w, 2);
  g = byte(w, 3);
  w = word-load(heap-p, or);
  b = byte(w, 0);
  f(r, g, b);
  r = byte(w, 1);
  g = byte(w, 2);
  b = byte(w, 3);
  f(r, g, b);
  
  if (i == max) return;

  loop(i+1, o+1);

Similar optimizations include allowing the user to write a one-channel filter function, and then applying it to grayscale, rgb, and rgba images without loss of performance. Applying such a function to an 8-bit indexed image is the next step.