2012-06-29

2D Interpolation, Part 5: Final Optimizations

The code has now the proper shape to carry out optimizations to their logical consequences, to the point that I question the wisdom in some of the previous transformations. It is true that in compiler construction some passes introduce constructs only for latter passes to eliminate them, in the hope of an overall simplification. In this case the starting point will be the elimination of the pixel-for-pixel dispatch to the required interpolator, by hoisting the switch out of the inner loop to a top-level procedure. I keep the last version of the interpolator as interp1 applying the same trick I used before to compare pixels:

void interp0() {
  switch (method) {
  case NEAREST:  interpN(); break;
  case BILINEAR: interpL(); break;
  case BICUBIC:  interpC(); break;
  }
}

The initial version of these interpolators are three identical copies of the previous interp0, appropriately renamed. For the sake of brevity I will omit these copies, showing instead the final, simplified form of each one. In the case of interpN(), I first eliminate the switch. Then I eliminate all references to the repeated border (variables having 0 or 3 as indices). Since there is no need to repeat elements, I now can access the source array directly. This in turn allows me to simplify all index computations, adjusting the condition on the last column as required. In fact, there is no need to keep count of the rows as yi already does that, so the end condition on the outer loop gets radically simpler. A further optimization I can afford is to eliminate the need for floating point constants yf, xf by inlining the integer calculation directly into the conditional:

void interpN() {
  int yr = 0;
  int yi = 1;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    int c11 = srcpix[(yi-1)*ncols+0], c12 = srcpix[(yi-1)*ncols+1];
    int c21 = srcpix[ yi   *ncols+0], c22 = srcpix[ yi   *ncols+1];
    int xr = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final int c = 2*yr < height ? 2*xr < width ? c11 : c12 : 2*xr < width ? c21 : c22;
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols) {
          c11 = c12; c12 = srcpix[(yi-1)*ncols+xi];
          c21 = c22; c22 = srcpix[ yi   *ncols+xi];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
    }
  }
}

The simplification on the linear interpolator interpL() proceeds exactly as before. The access pattern is identical, since the stencil is 2×2, so the end result is the same except for the pixel calculation:

void interpL() {
  int yr = 0;
  int yi = 1;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    int c11 = srcpix[(yi-1)*ncols+0], c12 = srcpix[(yi-1)*ncols+1];
    int c21 = srcpix[ yi   *ncols+0], c22 = srcpix[ yi   *ncols+1];
    int xr = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      final int c = bilinear(xf, yf, c11, c12, c21, c22);
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols) {
          c11 = c12; c12 = srcpix[(yi-1)*ncols+xi];
          c21 = c22; c22 = srcpix[ yi   *ncols+xi];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
    }
  }
}

This leaves me with the bicubic interpolator interpC(). On the one hand there is no need nor possibility to change anything besides eliminating the switch, as all previous transformations were based on the use of a 4×4 stencil with repetitions. On the other, I always question myself whether I can do better. I reason as follows: the stencil ranges over the image from pixel (-1, -1) to pixel (width−3, height−3); in case of under- or overflow the missing pixels are repeated. This means that using min and max as needed can fetch the repetitions for free! In general, the index will be min(max(yi-d, 0),nrows-1), but depending on the value of d and knowing the range of yi some simplifications are possible. The same is true for the repetition of the last column. In the inner loop I use the same trick I did with the arrayCopy(), that is, move the old pixels and fetch the new only if in range:

void interpC() {
  int yr = 0;
  int yi = 2;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    int c00 = srcpix[max(yi-3,     0)*ncols+0];
    int c01 = srcpix[max(yi-3,     0)*ncols+0];
    int c02 = srcpix[max(yi-3,     0)*ncols+1];
    int c03 = srcpix[max(yi-3,     0)*ncols+2];
    int c10 = srcpix[   (yi-2       )*ncols+0];
    int c11 = srcpix[   (yi-2       )*ncols+0];
    int c12 = srcpix[   (yi-2       )*ncols+1];
    int c13 = srcpix[   (yi-2       )*ncols+2];
    int c20 = srcpix[   (yi-1       )*ncols+0];
    int c21 = srcpix[   (yi-1       )*ncols+0];
    int c22 = srcpix[   (yi-1       )*ncols+1];
    int c23 = srcpix[   (yi-1       )*ncols+2];
    int c30 = srcpix[min(yi  ,nrows-1)*ncols+0];
    int c31 = srcpix[min(yi  ,nrows-1)*ncols+0];
    int c32 = srcpix[min(yi  ,nrows-1)*ncols+1];
    int c33 = srcpix[min(yi  ,nrows-1)*ncols+2];
    int xr = 0;
    int xi = 2;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      final int c = bicubic(xf, yf,
            c00, c01, c02, c03, c10, c11, c12, c13,
            c20, c21, c22, c23, c30, c31, c32, c33);
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        c00 = c01; c01 = c02; c02 = c03;
        c10 = c11; c11 = c12; c12 = c13;
        c20 = c21; c21 = c22; c22 = c23;
        c30 = c31; c31 = c32; c32 = c33;
        if (xi < ncols) {
          c03 = srcpix[max(yi-3,      0)*ncols+xi];
          c13 = srcpix[   (yi-2        )*ncols+xi];
          c23 = srcpix[   (yi-1        )*ncols+xi];
          c33 = srcpix[min(yi  ,nrows-1)*ncols+xi];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
    }
  }
}

Note that the stencil is loaded at the top of the inner loop with full knowledge of the values that yi and xi can take. Note also that eliminating the conditions implicit in the min and max calculations would require inverting the loops so that the control variables are yi and xi. This is possible but would require modular arithmetic on the destination indices; I would have to start the overall process from scratch.

In the mean time, I suspect that the introduction of row buffers was a red herring (although I swear an innocent, unwitting one). I didn't think of using min and max until the code was focused enough for me to see it from another side. What with insight it appears perfectly obvious can be and often is obscured by unrelated considerations. Selecting the "best" strategy at each step requires knowing the full path to the goal, but when faced with limited information the best we can hope for is that our heuristics don't mislead us too far away. In any case backtracking is inevitable for those of us lacking the gift of foreknowledge, and I think much more honest (if boring) to fully disclose my veering into blind alleys.

The take-away lesson of this (admittedly repetitious, if not downright tedious) exercise is to show how to manually apply some of the strategies that modern parallelizing compilers use automatically to vectorize and optimize straight-loop code, including the use of comparisons with unoptimized versions to assess and ensure the correctness of the transformations. When faced with optimized code I am often left wondering what process the author followed to arrive at that result; my own process is rather pedestrian and methodical and without any magic tricks other than a basic knowledge of high-school arithmetic.

2012-06-28

2D Interpolation, Part 4: ARGB Interpolation

After linearizing all array accesses, interpolating ARGB values is easy from the algorithmic side of things; so easy things first. I'll handle the pixel interpolation proper by abstracting them away in their own little functions. Processing an ARGB source array is just a matter of changing the variable declarations:

final int[] srcpix = {
0xff0051ff, 0xff00fffb, 0xff9cff00, 0xff0051ff,
0xffff0000, 0xff00ff08, 0xffffdb00, 0xff00fffb,
0xff9cff00, 0xff00fffb, 0xff0051ff, 0xffffdb00,
0xffffdb00, 0xff9cff00, 0xff00fffb, 0xff00ff08,
};
final int nrows = 4;
final int ncols = 4;

void interp0() {
  final int[] p = new int[4*ncols+8];
  p[1*ncols+2] = srcpix[0*ncols]; arrayCopy(srcpix, 0*ncols, p, 1*ncols+3, ncols); p[2*ncols+3] = srcpix[1*ncols-1];
  p[2*ncols+4] = srcpix[1*ncols]; arrayCopy(srcpix, 1*ncols, p, 2*ncols+5, ncols); p[3*ncols+5] = srcpix[2*ncols-1];
  p[3*ncols+6] = srcpix[2*ncols]; arrayCopy(srcpix, 2*ncols, p, 3*ncols+7, ncols); p[4*ncols+7] = srcpix[3*ncols-1];
  arrayCopy(p, 1*ncols+2, p, 0, ncols+2);
  int yr = 0;
  int yi = 2;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    int c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
    int c10 = p[1*ncols+2], c11 = p[1*ncols+3], c12 = p[1*ncols+4], c13 = p[1*ncols+5];
    int c20 = p[2*ncols+4], c21 = p[2*ncols+5], c22 = p[2*ncols+6], c23 = p[2*ncols+7];
    int c30 = p[3*ncols+6], c31 = p[3*ncols+7], c32 = p[3*ncols+8], c33 = p[3*ncols+9];
    int xr = 0;
    int xi = 3;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      int c = 0;
      switch (method) {
      case NEAREST:
        c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = bilinear(xf, yf, c11, c12, c21, c22);
        break;
      case BICUBIC:
        c = bicubic(xf, yf,
                    c00, c01, c02, c03, c10, c11, c12, c13,
                    c20, c21, c22, c23, c30, c31, c32, c33);
        break;
      }
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols+2) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi+0];
          c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi+2];
          c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi+4];
          c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi+6];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
      if (yi <= nrows) {
        arrayCopy(p, (ncols+2), p, 0, 3*(ncols+2));
        if (yi < nrows) {
          p[3*ncols+6] = srcpix[yi*ncols];
          arrayCopy(srcpix, yi*ncols, p, 3*ncols+7, ncols);
          p[4*ncols+7] = srcpix[(yi+1)*ncols-1];
        }
      }
    }
  }
}

The pixel array is the result of obtaining the hue() for each value in the original float array data. As you can see, I simplified the inner switch by extracting the interpolation schemes off to named routines. These routines will have to unpack the four components in each pixel, interpolate each component and pack back the result pixel. The bilinear interpolator is straightforward:

static int bilinear(final float t, final float u,
                    final int c00, final int c01,
                    final int c10, final int c11) {
  final int a00=(c00>>24)&0xff, r00=(c00>>16)&0xff, g00=(c00>>8)&0xff, b00=c00&0xff;
  final int a01=(c01>>24)&0xff, r01=(c01>>16)&0xff, g01=(c01>>8)&0xff, b01=c01&0xff;
  final int a10=(c10>>24)&0xff, r10=(c10>>16)&0xff, g10=(c10>>8)&0xff, b10=c10&0xff;
  final int a11=(c11>>24)&0xff, r11=(c11>>16)&0xff, g11=(c11>>8)&0xff, b11=c11&0xff;
  final float a = linear(u, linear(t, a00, a01), linear(t, a10, a11));
  final float r = linear(u, linear(t, r00, r01), linear(t, r10, r11));
  final float g = linear(u, linear(t, g00, g01), linear(t, g10, g11));
  final float b = linear(u, linear(t, b00, b01), linear(t, b10, b11));
  return (int) a << 24 | (int) r << 16 | (int) g << 8 | (int) b;

Note that the pixels are unpacked in "row" (ARGB) order but interpolated in "column" order; these transpositions are typical of vectorized and GPU code. The result is very different to the previous version of the linear interpolator, since I'm interpolating vectors here, not scalars mapped to a color ramp:

bilinear interpolation

You can see very noticeable Mach banding in the form of color ridges and creases. hue() mitigated this effect by smootherstep'ping the parameter, and I can do the same here, by converting a float channel into a saturated (that is, range limited) int channel:

static int bsat(float x) {
  if (x < 0)
    return 0;
  else if (x > 255)
    return 255;
  x /= 255f;
  return (int) ((10 - (15 - 6 * x) * x) * x * x * x * 255);
}

Changing the last line in bilinear() to:

  return bsat(a) << 24 | bsat(r) << 16 | bsat(g) << 8 | bsat(b);

results in an image with the high-frequency components smoothed out or filtered:

bilinear interpolation with smoothing

Much nicer. In fact I now understand why there are so many interpolation options to resize an image in Adobe Photoshop®. In the case of the bicubic interpolation I quickly learned that some sort of clipping and/or rescaling is vital, since the "tension" given by the derivatives can make the interpolated values overshoot:

bicubic interpolation with channel overflow errors

This is easily seen by masking out every channel but one:

bicubic interpolation with channel overflow errors, red channel

In fact, a little experimentation shows that interpolated values can go as low as -72 or as high as 327! The worst outliers are, of course, the result of interpolating high-frequency matrices of the form [[0 1 1 0] [1 0 0 1] [1 0 0 1] [0 1 1 0]]. Local rescaling is not a good choice, in my opinion, since it would wash out all channels equally; in particular, it would make images more transparent than they really are. The obvious choice is clipping, although doing that without filtering introduces artifacts:

bicubic interpolation with clipping

Using the bsat() above produces the best results:

bicubic interpolation with clipping and smoothing

The final code for the ARGB bicubic interpolator is:

static int bicubic(final float t, final float u,
                   final int c00, final int c01, final int c02, final int c03,
                   final int c10, final int c11, final int c12, final int c13,
                   final int c20, final int c21, final int c22, final int c23,
                   final int c30, final int c31, final int c32, final int c33) {
  final int a00=(c00>>24)&0xff, r00=(c00>>16)&0xff, g00=(c00>>8)&0xff, b00=c00&0xff;
  final int a01=(c01>>24)&0xff, r01=(c01>>16)&0xff, g01=(c01>>8)&0xff, b01=c01&0xff;
  final int a02=(c02>>24)&0xff, r02=(c02>>16)&0xff, g02=(c02>>8)&0xff, b02=c02&0xff;
  final int a03=(c03>>24)&0xff, r03=(c03>>16)&0xff, g03=(c03>>8)&0xff, b03=c03&0xff;
  final int a10=(c10>>24)&0xff, r10=(c10>>16)&0xff, g10=(c10>>8)&0xff, b10=c10&0xff;
  final int a11=(c11>>24)&0xff, r11=(c11>>16)&0xff, g11=(c11>>8)&0xff, b11=c11&0xff;
  final int a12=(c12>>24)&0xff, r12=(c12>>16)&0xff, g12=(c12>>8)&0xff, b12=c12&0xff;
  final int a13=(c13>>24)&0xff, r13=(c13>>16)&0xff, g13=(c13>>8)&0xff, b13=c13&0xff;
  final int a20=(c20>>24)&0xff, r20=(c20>>16)&0xff, g20=(c20>>8)&0xff, b20=c20&0xff;
  final int a21=(c21>>24)&0xff, r21=(c21>>16)&0xff, g21=(c21>>8)&0xff, b21=c21&0xff;
  final int a22=(c22>>24)&0xff, r22=(c22>>16)&0xff, g22=(c22>>8)&0xff, b22=c22&0xff;
  final int a23=(c23>>24)&0xff, r23=(c23>>16)&0xff, g23=(c23>>8)&0xff, b23=c23&0xff;
  final int a30=(c30>>24)&0xff, r30=(c30>>16)&0xff, g30=(c30>>8)&0xff, b30=c30&0xff;
  final int a31=(c31>>24)&0xff, r31=(c31>>16)&0xff, g31=(c31>>8)&0xff, b31=c31&0xff;
  final int a32=(c32>>24)&0xff, r32=(c32>>16)&0xff, g32=(c32>>8)&0xff, b32=c32&0xff;
  final int a33=(c33>>24)&0xff, r33=(c33>>16)&0xff, g33=(c33>>8)&0xff, b33=c33&0xff;
  final float a = cubic(u, cubic(t, a00, a01, a02, a03), cubic(t, a10, a11, a12, a13),
                           cubic(t, a20, a21, a22, a23), cubic(t, a30, a31, a32, a33));
  final float r = cubic(u, cubic(t, r00, r01, r02, r03), cubic(t, r10, r11, r12, r13),
                           cubic(t, r20, r21, r22, r23), cubic(t, r30, r31, r32, r33));
  final float g = cubic(u, cubic(t, g00, g01, g02, g03), cubic(t, g10, g11, g12, g13),
                           cubic(t, g20, g21, g22, g23), cubic(t, g30, g31, g32, g33));
  final float b = cubic(u, cubic(t, b00, b01, b02, b03), cubic(t, b10, b11, b12, b13),
                           cubic(t, b20, b21, b22, b23), cubic(t, b30, b31, b32, b33));
  return bsat(a) << 24 | bsat(r) << 16 | bsat(g) << 8 | bsat(b);
}

Quite a mouthful, but I arrived at it by methodic and meticulous columnar editing. Now everything is ready to tackle the problem of hoisting the conditional in the inner loop and deriving three different, fully streamlined interpolators. Until next time.

2012-06-27

2D Interpolation, Part 3: Linear Array Accesses

Last time I've hoisted all accesses to the source array. This opens the door to being able to process linear arrays representing a matrix in row-major order by stenciling. Not only that, but I was able to completely eliminate the need to have a bordered array on input by explicitly replicating elements as needed. The first step is to use a row buffer into which to copy the elements to be processed. Instead of stenciling by indexing into the source array, I'll index into this row buffer. The code is identical to the one presented last time, except for the initialization, done by copying, and the indexing. Instead of four arrays p0, p1, p2 and p3 I use just one containing four rows back to front; this requires computing the indices by hand. To bring in the next row to interpolate, a pair of arrayCopys make room in the row buffer and assign the new row in the opened space. Now both yi and xi cease to represent the index of the stencil over the source array, and I'm free to simplify the index arithmetic by offsetting their starting values:

void interp0() {
  final float[] p = new float[4 * nx];
  for (int i = 0; i < 4; i++)
    arrayCopy(data[i], 0, p, i*nx, nx);
  int yr = 0;
  int yi = 3;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    float c00 = p[0*nx+0], c01 = p[0*nx+1], c02 = p[0*nx+2], c03 = p[0*nx+3];
    float c10 = p[1*nx+0], c11 = p[1*nx+1], c12 = p[1*nx+2], c13 = p[1*nx+3];
    float c20 = p[2*nx+0], c21 = p[2*nx+1], c22 = p[2*nx+2], c23 = p[2*nx+3];
    float c30 = p[3*nx+0], c31 = p[3*nx+1], c32 = p[3*nx+2], c33 = p[3*nx+3];
    int xr = 0;
    int xi = 3;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yf,
            linear(xf, c11, c12),
            linear(xf, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yf,
            cubic(xf, c00, c01, c02, c03),
            cubic(xf, c10, c11, c12, c13),
            cubic(xf, c20, c21, c22, c23),
            cubic(xf, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xr += nx - 3;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < nx) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p[0*nx+xi];
          c10 = c11; c11 = c12; c12 = c13; c13 = p[1*nx+xi];
          c20 = c21; c21 = c22; c22 = c23; c23 = p[2*nx+xi];
          c30 = c31; c31 = c32; c32 = c33; c33 = p[3*nx+xi];
        }
      }
    }
    yr += ny - 3;
    if (yr >= height) {
      yr -= height;
      yi += 1;
      if (yi < ny) {
        arrayCopy(p, nx, p, 0, 3*nx);
        arrayCopy(data[yi], 0, p, 3*nx, nx);
      }
    }
  }
}

Processing a matrix as a linear array in row-major order is a matter of adjusting the sources to the arrayCopys. I've renamed the source array as srcpix and its dimensions as nrows and ncols so that later I can modify the code to avoid the need for an explicit border without breaking the existing interp1():

final float[] srcpix = {
  0.60, 0.60, 0.48, 0.24, 0.60, 0.60,
  0.60, 0.60, 0.48, 0.24, 0.60, 0.60,
  0.00, 0.00, 0.36, 0.12, 0.48, 0.48,
  0.24, 0.24, 0.48, 0.60, 0.12, 0.12,
  0.12, 0.12, 0.24, 0.48, 0.36, 0.36,
  0.12, 0.12, 0.24, 0.48, 0.36, 0.36
};
final int nrows = 6;
final int ncols = 6;

void interp0() {
  final float[] p = new float[4 * ncols];
  arrayCopy(srcpix, 0, p, 0, 4 * ncols);
  int yr = 0;
  int yi = 3;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    float c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
    float c10 = p[1*ncols+0], c11 = p[1*ncols+1], c12 = p[1*ncols+2], c13 = p[1*ncols+3];
    float c20 = p[2*ncols+0], c21 = p[2*ncols+1], c22 = p[2*ncols+2], c23 = p[2*ncols+3];
    float c30 = p[3*ncols+0], c31 = p[3*ncols+1], c32 = p[3*ncols+2], c33 = p[3*ncols+3];
    int xr = 0;
    int xi = 3;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yf,
            linear(xf, c11, c12),
            linear(xf, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yf,
            cubic(xf, c00, c01, c02, c03),
            cubic(xf, c10, c11, c12, c13),
            cubic(xf, c20, c21, c22, c23),
            cubic(xf, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xr += ncols - 3;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi];
          c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi];
          c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi];
          c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi];
        }
      }
    }
    yr += nrows - 3;
    if (yr >= height) {
      yr -= height;
      yi += 1;
      if (yi < nrows) {
        arrayCopy(p, ncols, p, 0, 3*ncols);
        arrayCopy(srcpix, yi*ncols, p, 3*ncols, ncols);
      }
    }
  }
}

Now in a slow-motion approach towards eliminating the repeated border, I'll first reduce the array dimensions to their true value. In the preceding code I substitute nrows := nrows−2 and ncols := ncols−2, and simplify all indices:

final int nrows = 4;
final int ncols = 4;

void interp0() {
  final float[] p = new float[4*(ncols+2)];
  arrayCopy(srcpix, 0, p, 0, 4*(ncols+2));
  int yr = 0;
  int yi = 3;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    float c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
    float c10 = p[1*ncols+2], c11 = p[1*ncols+3], c12 = p[1*ncols+4], c13 = p[1*ncols+5];
    float c20 = p[2*ncols+4], c21 = p[2*ncols+5], c22 = p[2*ncols+6], c23 = p[2*ncols+7];
    float c30 = p[3*ncols+6], c31 = p[3*ncols+7], c32 = p[3*ncols+8], c33 = p[3*ncols+9];
    int xr = 0;
    int xi = 3;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yf,
            linear(xf, c11, c12),
            linear(xf, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yf,
            cubic(xf, c00, c01, c02, c03),
            cubic(xf, c10, c11, c12, c13),
            cubic(xf, c20, c21, c22, c23),
            cubic(xf, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols+2) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi+0];
          c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi+2];
          c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi+4];
          c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi+6];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
      if (yi < nrows+2) {
        arrayCopy(p, (ncols+2), p, 0, 3*(ncols+2));
        arrayCopy(srcpix, yi*(ncols+2), p, 3*ncols+6, ncols+2);
      }
    }
  }
}

In the second part of the slo-mo maneuver, I'll eliminate the repeated values from the source matrix. Every row repeats the first and last element, and both the first and the last rows are repeated in turn. I achieve this by careful stuffing into the row buffer:

final float[] srcpix = {
0.60, 0.48, 0.24, 0.60,
0.00, 0.36, 0.12, 0.48,
0.24, 0.48, 0.60, 0.12,
0.12, 0.24, 0.48, 0.36,
};
final int nrows = 4;
final int ncols = 4;

void interp0() {
  final float[] p = new float[4*ncols+8];
  p[1*ncols+2] = srcpix[0*ncols]; arrayCopy(srcpix, 0*ncols, p, 1*ncols+3, ncols); p[2*ncols+3] = srcpix[1*ncols-1];
  p[2*ncols+4] = srcpix[1*ncols]; arrayCopy(srcpix, 1*ncols, p, 2*ncols+5, ncols); p[3*ncols+5] = srcpix[2*ncols-1];
  p[3*ncols+6] = srcpix[2*ncols]; arrayCopy(srcpix, 2*ncols, p, 3*ncols+7, ncols); p[4*ncols+7] = srcpix[3*ncols-1];
  arrayCopy(p, 1*ncols+2, p, 0, ncols+2);
  int yr = 0;
  int yi = 2;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    float c00 = p[0*ncols+0], c01 = p[0*ncols+1], c02 = p[0*ncols+2], c03 = p[0*ncols+3];
    float c10 = p[1*ncols+2], c11 = p[1*ncols+3], c12 = p[1*ncols+4], c13 = p[1*ncols+5];
    float c20 = p[2*ncols+4], c21 = p[2*ncols+5], c22 = p[2*ncols+6], c23 = p[2*ncols+7];
    float c30 = p[3*ncols+6], c31 = p[3*ncols+7], c32 = p[3*ncols+8], c33 = p[3*ncols+9];
    int xr = 0;
    int xi = 3;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yf < 0.5 ? xf < 0.5 ? c11 : c12 : xf < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yf,
            linear(xf, c11, c12),
            linear(xf, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yf,
            cubic(xf, c00, c01, c02, c03),
            cubic(xf, c10, c11, c12, c13),
            cubic(xf, c20, c21, c22, c23),
            cubic(xf, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols+2) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p[0*ncols+xi+0];
          c10 = c11; c11 = c12; c12 = c13; c13 = p[1*ncols+xi+2];
          c20 = c21; c21 = c22; c22 = c23; c23 = p[2*ncols+xi+4];
          c30 = c31; c31 = c32; c32 = c33; c33 = p[3*ncols+xi+6];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
      if (yi <= nrows) {
        arrayCopy(p, (ncols+2), p, 0, 3*(ncols+2));
        if (yi < nrows) {
          p[3*ncols+6] = srcpix[yi*ncols];
          arrayCopy(srcpix, yi*ncols, p, 3*ncols+7, ncols);
          p[4*ncols+7] = srcpix[(yi+1)*ncols-1];
        }
      }
    }
  }
}

Note that during initialization the first row is duplicated by an explicit copy. Note also that the inner loop doesn't change at all. The outer loop does; since the first duplicated row was eliminated, yi initially is 2, not 3; since the last duplicated row was eliminated, there new bounds for yi is nrows, not nrows + 2, and the conditions simplify accordingly. Since the last row is repeated, it must be processed even when yi is exactly nrows. Finally, since the new row must repeat its first and last elements, I must use the same schema as for the initialization. In the case of the repeated last row, the first arrayCopy moves the three last rows to the front; by doing nothing the very last row appears in the last two consecutive strides of the row buffer.

Believe it or not, I consider this code rather compact. Normally, the special case of repeated elements is handled by unrolling the loops and placing a loop header and a loop footer specialized to the appropriate values. By using the row buffer, I can avoid duplicating code, for a cost in O(N) storage. As it is, the only transformation required would be to extract the switch from the inner loop into separate the interpolators and simplify the nearest-neighbor and bilinear cases by eliminating the unneeded elements. Before doing that, it would be best to tackle the problem of processing ARGB sources, by unpacking the channels in the source pixels, interpolating each and repacking them into a destination pixel. That can be done neatly with the row buffers for maximal inner-loop speed. Until the next time!

2012-06-26

2D Interpolation, Part 2: Minimizing Array Accesses

Last time I massaged the interpolator to avoid computing every time the linear transformation from destination space to source space, using only integer variables. With that rewriting, it is time to avoid referencing source values more than is needed. First thing we do, let's name all elements:

void interp0() {
  int offset = 0;
  int yn = 0;
  int yi = 1;
  for (int i = 0; i < height; i++) {
    final float yr = (float) yn / (float) height;
    int xn = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final float xr = (float) xn / (float) width;
      float c00 = data[yi-1][xi-1], c01 = data[yi-1][xi+0], c02 = data[yi-1][xi+1], c03 = data[yi-1][xi+2];
      float c10 = data[yi+0][xi-1], c11 = data[yi+0][xi+0], c12 = data[yi+0][xi+1], c13 = data[yi+0][xi+2];
      float c20 = data[yi+1][xi-1], c21 = data[yi+1][xi+0], c22 = data[yi+1][xi+1], c23 = data[yi+1][xi+2];
      float c30 = data[yi+2][xi-1], c31 = data[yi+2][xi+0], c32 = data[yi+2][xi+1], c33 = data[yi+2][xi+2];
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, c11, c12),
            linear(xr, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, c00, c01, c02, c03),
            cubic(xr, c10, c11, c12, c13),
            cubic(xr, c20, c21, c22, c23),
            cubic(xr, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xn += nx - 3;
      if (xn >= width) {
        xn -= width;
        xi += 1;
      }
    }
    yn += ny - 3;
    if (yn >= height) {
      yn -= height;
      yi += 1;
    }
  }
}

Two things are of note: first, I need all 16 elements for bicubic interpolation, even if nearest-neighbor or bilinear were requested. Second, in the nearest-neighbor case I had to transform an index computation involving a float-to-integer rounding into explicit conditionals. Such is the price of progress.

Since yi, xi don't change at each step, it is possible to just reference new elements when they do, and reuse values in the mean time. First I will reuse the columns. In order to do that I need to hoist the assignments outside the inner loop, in effect converting them to pure initializations, and arrange for the variables to update when xi is incremented:

void interp0() {
  int offset = 0;
  int yn = 0;
  int yi = 1;
  for (int i = 0; i < height; i++) {
    final float yr = (float) yn / (float) height;
    int xn = 0;
    int xi = 1;
    float c00 = data[yi-1][0], c01 = data[yi-1][1], c02 = data[yi-1][2], c03 = data[yi-1][3];
    float c10 = data[yi+0][0], c11 = data[yi+0][1], c12 = data[yi+0][2], c13 = data[yi+0][3];
    float c20 = data[yi+1][0], c21 = data[yi+1][1], c22 = data[yi+1][2], c23 = data[yi+1][3];
    float c30 = data[yi+2][0], c31 = data[yi+2][1], c32 = data[yi+2][2], c33 = data[yi+2][3];
    for (int j = 0; j < width; j++) {
      final float xr = (float) xn / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, c11, c12),
            linear(xr, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, c00, c01, c02, c03),
            cubic(xr, c10, c11, c12, c13),
            cubic(xr, c20, c21, c22, c23),
            cubic(xr, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xn += nx - 3;
      if (xn >= width) {
        xn -= width;
        xi += 1;
        if (xi < nx - 2) {
          c00 = c01; c01 = c02; c02 = c03; c03 = data[yi-1][xi+2];
          c10 = c11; c11 = c12; c12 = c13; c13 = data[yi+0][xi+2];
          c20 = c21; c21 = c22; c22 = c23; c23 = data[yi+1][xi+2];
          c30 = c31; c31 = c32; c32 = c33; c33 = data[yi+2][xi+2];
        }
      }
    }
    yn += ny - 3;
    if (yn >= height) {
      yn -= height;
      yi += 1;
    }
  }
}

The initialization inlines the constant value xi = 1 in the second index. The update shifts the values around in classic bucket brigade fashion, and fetches fresh values for the new column to process. Note that there is a subtlety involving the range check, in that the last iteration for j will have xi = nx−2, but at that point it is too early to notice: logically the loop termination test should come before updating xn, xi and the array elements. I opt here to conditionalize the update to keep the for loop structured; feel free to break out of the loop instead.

Now it's time to reuse the rows. This is made simpler by the fact that the source matrix is an array of arrays. First I introduce row variables pointing to each active row, and rename the element accesses accordingly:

void interp0() {
  int offset = 0;
  int yn = 0;
  int yi = 1;
  for (int i = 0; i < height; i++) {
    final float yr = (float) yn / (float) height;
    int xn = 0;
    int xi = 1;
    float[] p0 = data[yi-1], p1 = data[yi+0], p2 = data[yi+1], p3 = data[yi+2];
    float c00 = p0[0], c01 = p0[1], c02 = p0[2], c03 = p0[3];
    float c10 = p1[0], c11 = p1[1], c12 = p1[2], c13 = p1[3];
    float c20 = p2[0], c21 = p2[1], c22 = p2[2], c23 = p2[3];
    float c30 = p3[0], c31 = p3[1], c32 = p3[2], c33 = p3[3];
    for (int j = 0; j < width; j++) {
      final float xr = (float) xn / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, c11, c12),
            linear(xr, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, c00, c01, c02, c03),
            cubic(xr, c10, c11, c12, c13),
            cubic(xr, c20, c21, c22, c23),
            cubic(xr, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xn += nx - 3;
      if (xn >= width) {
        xn -= width;
        xi += 1;
        if (xi < nx - 2) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p0[xi+2];
          c10 = c11; c11 = c12; c12 = c13; c13 = p1[xi+2];
          c20 = c21; c21 = c22; c22 = c23; c23 = p2[xi+2];
          c30 = c31; c31 = c32; c32 = c33; c33 = p3[xi+2];
        }
      }
    }
    yn += ny - 3;
    if (yn >= height) {
      yn -= height;
      yi += 1;
    }
  }
}

Next I hoist the row assignments out of the outer loop, as I did before:

void interp0() {
  float[] p0 = data[0], p1 = data[1], p2 = data[2], p3 = data[3];
  int yn = 0;
  int yi = 1;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yr = (float) yn / (float) height;
    float c00 = p0[0], c01 = p0[1], c02 = p0[2], c03 = p0[3];
    float c10 = p1[0], c11 = p1[1], c12 = p1[2], c13 = p1[3];
    float c20 = p2[0], c21 = p2[1], c22 = p2[2], c23 = p2[3];
    float c30 = p3[0], c31 = p3[1], c32 = p3[2], c33 = p3[3];
    int xn = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final float xr = (float) xn / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = yr < 0.5 ? xr < 0.5 ? c11 : c12 : xr < 0.5 ? c21 : c22;
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, c11, c12),
            linear(xr, c21, c22) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, c00, c01, c02, c03),
            cubic(xr, c10, c11, c12, c13),
            cubic(xr, c20, c21, c22, c23),
            cubic(xr, c30, c31, c32, c33) );
        break;
      }
      pixels[offset++] = hue(c);
      xn += nx - 3;
      if (xn >= width) {
        xn -= width;
        xi += 1;
        if (xi < nx - 2) {
          c00 = c01; c01 = c02; c02 = c03; c03 = p0[xi+2];
          c10 = c11; c11 = c12; c12 = c13; c13 = p1[xi+2];
          c20 = c21; c21 = c22; c22 = c23; c23 = p2[xi+2];
          c30 = c31; c31 = c32; c32 = c33; c33 = p3[xi+2];
        }
      }
    }
    yn += ny - 3;
    if (yn >= height) {
      yn -= height;
      yi += 1;
      if (yi < ny - 2) {
        p0 = p1; p1 = p2; p2 = p3; p3 = data[yi+2];
      }
    }
  }
}

Here I'm presented with exactly the same situation regarding the last iteration as with the columns, and faced with the same choice. At this point I could call it quits and be satisfied with a reasonable implementation; I'll press further. Next time I'll convert the interpolator to operate on a linear array, opening the door to manipulating graphics directly.

2012-06-25

2D Interpolation, Part 1: The Digital Differential Analyzer

To my Planet OCaml readers, I apologize for veering into Java. I've been playing with Processing of lately, because it's far easier to prototype silly, simple animations (fractals! Fractals everywhere!) in it than by using OCamlSDL, say. Last time I showed a basic 2D interpolator; now it's time to start massaging it into shape. Let's recall the original code:

void interp0() {
  final float dy = (float) (ny - 3) / height;
  final float dx = (float) (nx - 3) / width;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    float yr = 1 + dy * i;
    int yi = (int) yr;
    yr -= yi;
    for (int j = 0; j < width; j++) {
      float xr = 1 + dx * j;
      int xi = (int) xr;
      xr -= xi;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = data[yi + (int) (yr + 0.5)][xi + (int) (xr + 0.5)];
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, data[yi+0][xi+0], data[yi+0][xi+1]),
            linear(xr, data[yi+1][xi+0], data[yi+1][xi+1]) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, data[yi-1][xi-1], data[yi-1][xi+0], data[yi-1][xi+1], data[yi-1][xi+2]),
            cubic(xr, data[yi+0][xi-1], data[yi+0][xi+0], data[yi+0][xi+1], data[yi+0][xi+2]),
            cubic(xr, data[yi+1][xi-1], data[yi+1][xi+0], data[yi+1][xi+1], data[yi+1][xi+2]),
            cubic(xr, data[yi+2][xi-1], data[yi+2][xi+0], data[yi+2][xi+1], data[yi+2][xi+2]) );
        break;
      }
      pixels[offset++] = hue(c);
    }
  }
}

The coordinate transformations from destination array to source matrix involve a linear transformation s = s0 + k·(d - d0), where k is a floating-point approximation to a rational number. This can be done using the all-integer Digital Differential Analyzer or DDA. First of all, note that there's no reason for yr, xr to start at 1; only yi, xi need to be offset:

void interp0() {
  final float dy = (float) (ny - 3) / height;
  final float dx = (float) (nx - 3) / width;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    float yr = dy * i;
    int yi = (int) yr;
    yr -= yi;
    yi += 1;
    for (int j = 0; j < width; j++) {
      float xr = dx * j;
      int xi = (int) xr;
      xr -= xi;
      xi += 1;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = data[yi + (int) (yr + 0.5)][xi + (int) (xr + 0.5)];
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, data[yi+0][xi+0], data[yi+0][xi+1]),
            linear(xr, data[yi+1][xi+0], data[yi+1][xi+1]) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, data[yi-1][xi-1], data[yi-1][xi+0], data[yi-1][xi+1], data[yi-1][xi+2]),
            cubic(xr, data[yi+0][xi-1], data[yi+0][xi+0], data[yi+0][xi+1], data[yi+0][xi+2]),
            cubic(xr, data[yi+1][xi-1], data[yi+1][xi+0], data[yi+1][xi+1], data[yi+1][xi+2]),
            cubic(xr, data[yi+2][xi-1], data[yi+2][xi+0], data[yi+2][xi+1], data[yi+2][xi+2]) );
        break;
      }
      pixels[offset++] = hue(c);
    }
  }
}

At this point you will notice error pixels along some contours. Those are the pixels for which the floating-point quotient (ny−3)/height, resp. (nx−3)/width is not exactly representable, and shifting the floating point values by one makes the error manifest.

Variables yr, xr could start at 0 and be incremented by dy, dx respectively, at the end of the corresponding loop. This is the essence of the DDA. Unfortunately this would accumulate the error inherent in computing dy and dx very quickly.

Introduce a variable yn = yr·height so that yr = yn/height. Every time yr is incremented by dy = (ny−3)/height, yn is incremented by ny−3 which is an integer. Now yi is the integer part of yr, which is to say the integer part of yn/height. This means that yi will increase whenever ynheight; we can keep track of the quotient of i·(ny−3) by height in yi and let yn be the remainder, adjusting it at each step.

Exactly the same reasoning applies to xr and xi by introducing an integer variable xn representing the remainder of j·(nx−3) when divided by width. Making both changes results in yr, xr being simple constants:

void interp0() {
  int offset = 0;
  int yn = 0;
  int yi = 1;
  for (int i = 0; i < height; i++) {
    final float yr = (float) yn / (float) height;
    int xn = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final float xr = (float) xn / (float) width;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = data[yi + (int) (yr + 0.5)][xi + (int) (xr + 0.5)];
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, data[yi+0][xi+0], data[yi+0][xi+1]),
            linear(xr, data[yi+1][xi+0], data[yi+1][xi+1]) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, data[yi-1][xi-1], data[yi-1][xi+0], data[yi-1][xi+1], data[yi-1][xi+2]),
            cubic(xr, data[yi+0][xi-1], data[yi+0][xi+0], data[yi+0][xi+1], data[yi+0][xi+2]),
            cubic(xr, data[yi+1][xi-1], data[yi+1][xi+0], data[yi+1][xi+1], data[yi+1][xi+2]),
            cubic(xr, data[yi+2][xi-1], data[yi+2][xi+0], data[yi+2][xi+1], data[yi+2][xi+2]) );
        break;
      }
      pixels[offset++] = hue(c);
      xn += nx - 3;
      if (xn >= width) {
        xn -= width;
        xi += 1;
      }
    }
    yn += ny - 3;
    if (yn >= height) {
      yn -= height;
      yi += 1;
    }
  }
}

Next time I'll show how to put the array accesses on a diet.

2012-06-22

2D Interpolation, Part 0: The Groundwork

(First of a series) My process for going from a textbook implementation of an algorithm to an efficient production-grade version of it is to methodically apply meaning-preserving transformations that make the code progressively tighter. In this case I'll massage a naïve 2D interpolator that selectively uses nearest-neighbor, bilinear and bicubic interpolation. I started by studying a basic explanation and completed it with the specifics related to bicubic interpolation. Since this is a graphics application, I define meaning-preserving as "agrees on (almost) all pixels", and I carry the test by visual inspection. I use a straight Processing 2 sketch without any complications or embellishments.

To begin with my source data will be a simple 4×4 matrix of floats in the form of an array of arrays with elements repeated around the border. This is not only suboptimal on any number of accounts, it doesn't accomodate the overwhelmingly usual case of image stretching; but for now it is sufficient:

final float[][] data = {
  { 0.60, 0.60, 0.48, 0.24, 0.60, 0.60 },
  { 0.60, 0.60, 0.48, 0.24, 0.60, 0.60 },
  { 0.00, 0.00, 0.36, 0.12, 0.48, 0.48 },
  { 0.24, 0.24, 0.48, 0.60, 0.12, 0.12 },
  { 0.12, 0.12, 0.24, 0.48, 0.36, 0.36 },
  { 0.12, 0.12, 0.24, 0.48, 0.36, 0.36 }
};
final int ny = data.length;
final int nx = data[0].length;

Both bilinear and bicubic interpolation schemes are separable in that they proceed by interpolating by rows first and columns last, in effect composing two 1D interpolations into a single 2D one. In turn, linear and cubic interpolation schemes are expressed as polynomials on an interpolating parameter, t, with coefficients depending on the values being interpolated and obeying certain continuity conditions. These polynomials are standard and derived in a number of places:

static float linear(final float t, final float p0, final float p1) {
  return p0 + t*(p1 - p0);
}

static float cubic(final float t, final float p0, final float p1, final float p2, final float p3) {
  return p1 + 0.5*t*(p2 - p0 + t*(2*p0 - 5*p1 + 4*p2 - p3 + t*(3*(p1 - p2) + p3 - p0)));
}

The functions (ab-)use final and static to convey the impression that they are pure functions (not really, it's just to give HotSpot the chance to inline them). In order to visualize floating-point numbers I use a rainbow palette:

static color hue(float h) {
  final int d, e;
  h = 6 * (h - (float) Math.floor(h));
  e = (int) Math.floor(h);
  h -= e;
  h = (10 - (15 - 6 * h) * h) * h * h * h;
  d = (int) (255 * h);
  switch (e) {
  case 0: return 0xffff0000 |      d  <<  8;
  case 1: return 0xff00ff00 | (255-d) << 16;
  case 2: return 0xff00ff00 |      d;
  case 3: return 0xff0000ff | (255-d) <<  8;
  case 4: return 0xff0000ff |      d  << 16;
  case 5: return 0xffff0000 | (255-d);
  }
  return 0;
}

Why don't I just use Processing's HSB color mode? Two reasons: first and foremost, to use smootherstep to interpolate around the color circle (this eliminates Mach banding and gives prominence to secondary colors); second, so that code that uses it can be made independent of Processing.

The sketch will wait for mouse clicks to react, so the set up explicitly turns off looping:

void setup() {
  size(400, 400);
  noLoop();
}

Every mouse click changes the interpolation scheme in a cyclic fashion:

final static int NEAREST  = 0;
final static int BILINEAR = 1;
final static int BICUBIC  = 2;
int method = 0;

void mouseClicked() {
  method++;
  if (method > BICUBIC)
    method = NEAREST;
  redraw();
}

The sketch will draw directly into the pixels plane, in two passes: the first one will be the current version of the interpolator, and the second will be a baseline version that blacks out identical pixels to highlight differences.

void draw() {
  loadPixels();
  interp0();
  interp1();
  updatePixels();
}

The baseline interpolator, version 0, proceeds in row-major order (first by rows, then by columns) on the destination array pixels to avoid gaps. It calculates (lines 6, 10) the corresponding coordinate in the source matrix as a real (as opposed to integer, not to imaginary), and separates it into an integer part and a fractional part (lines 7–8, 11–12). It then computes an interpolated value from the source data according to method (lines 14–30), computes a color and assigns the pixel (line 31).

void interp0() {
  final float dy = (float) (ny - 3) / height;
  final float dx = (float) (nx - 3) / width;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    float yr = 1 + dy * i;
    int yi = (int) yr;
    yr -= yi;
    for (int j = 0; j < width; j++) {
      float xr = 1 + dx * j;
      int xi = (int) xr;
      xr -= xi;
      float c = 0;
      switch (method) {
      case NEAREST:
        c = data[yi + (int) (yr + 0.5)][xi + (int) (xr + 0.5)];
        break;
      case BILINEAR:
        c = linear(yr,
            linear(xr, data[yi+0][xi+0], data[yi+0][xi+1]),
            linear(xr, data[yi+1][xi+0], data[yi+1][xi+1]) );
        break;
      case BICUBIC:
        c = cubic(yr,
            cubic(xr, data[yi-1][xi-1], data[yi-1][xi+0], data[yi-1][xi+1], data[yi-1][xi+2]),
            cubic(xr, data[yi+0][xi-1], data[yi+0][xi+0], data[yi+0][xi+1], data[yi+0][xi+2]),
            cubic(xr, data[yi+1][xi-1], data[yi+1][xi+0], data[yi+1][xi+1], data[yi+1][xi+2]),
            cubic(xr, data[yi+2][xi-1], data[yi+2][xi+0], data[yi+2][xi+1], data[yi+2][xi+2]) );
        break;
      }
      pixels[offset++] = hue(c);
    }
  }
}

Every destination pixel with coordinate (i, j) maps to a source element (i*(rows - 1)/height, j*(cols - 1)/width) by rule of three. Since the source matrix already has a border the mapping is adjusted by substituting cols = nx - 2 and rows = ny - 2 and adding 1 to the coordinates. The fractional part of each coordinate indicates "how far along" the neighboring pixels the interpolation should go. The comparison is made with this algorithm in function interp1 and where line 31 is changed to:

      if (pixels[offset] == hue(c))
        pixels[offset] = 0xff000000;
      offset++;

The results are:

nearest-neighbor interpolation

for the nearest-neighbor interpolation,

bilinear interpolation

for the bilinear interpolation, and:

bicubic interpolation

for the bicubic interpolation. I know, the colors are toxic. Of course, with the draw() that I've given above the result should be solid black for each mouse click; to see the interpolations in action, comment the call to interp1().

In the next post I'll show a number of massages performed on interp0().