Fast Gaussian Blur via Resampling

A large Gaussian blur can be computed cheaply by downsampling the image, blurring at low resolution, and upsampling back. This page lets you explore the accuracy of this approximation for different choices of downsample factor, and prefilter/interpolator order.

The method works as follows: For a target blur with standard deviation \sigma, downsample by a factor f using a prefilter of order d, blur at low resolution with a Gaussian of standard deviation \sigma_\text{lo} = \sqrt{\sigma^2 - V_\text{up} - V_\text{down}} \;/\; f, where V_\text{up} and V_\text{down} are the variances of the interpolator and prefilter respectively, then upsample with an interpolator of order u.

The prefilter and interpolator of order o at factor f are constructed as the convolution of o box filters of size f and o - 1 copies of [1,\;1], giving a filter of width o \cdot f. The prefilter is normalized to sum to 1. The interpolator is normalized so that its polyphase components each sum to 1 (equivalently, it sums to f).

Use the controls below to vary the parameters and observe the impulse response of the approximation (red) overlaid on the true Gaussian (blue). The Pareto plot below shows the quality–cost tradeoff across all parameter combinations.

Filtering by resampling has a long history. Like Andrew, these ideas come from the early 80s. There are many ways to do it — see for example Burt and Adelson's "The Laplacian Pyramid as a Compact Image Code" (1983), or more recently Convolution Pyramids by Farbman et al. (2011). However, if you want the best performance and quality, precisely how you perform the resampling and low-resolution filtering matters a lot.

This opinion comes from our work "Finding Fast Filters", to appear at SIGGRAPH 2026 (link to come), in which we automatically searched a large space of filter implementations and found that the most impactful way to make a Gaussian blur fast is by operating at a single \sigma-dependent reduced resolution. Pyramids are unnecessary. You can make it faster still by doing the low-resolution blur using tail-cancelling IIRs (Wang & Smith 1997) and frequency response masking (Lim 1986) — this matters most at low \sigma, where the downsampling factor is small. You can also make it slightly more accurate by tuning the filters with gradient descent. For the purpose of this page we’ll avoid such complexities.

TL;DR: For a very accurate fast Gaussian blur, downsample by a factor of roughly \sigma/2 using 3rd-order prefilter and interpolator. For large \sigma this achieves high PSNR at a cost of under 8 multiplies per output pixel, regardless of \sigma.

Interesting cases

	The sweet spot (σ=16, u=3, d=3, f=8): 3rd-order filters at factor σ/2 avoid all the issues below, giving excellent quality at under 8 multiplies per pixel, regardless of σ.
	Low-quality interpolator (σ=16, u=1, d=4, f=8): A low-order interpolator leaves aliased copies of the low-resolution signal in the output, resulting in unwanted high frequencies. These look like stair steps in the spatial domain and red humps at multiples of the low-res max frequency in the frequency domain.
	Low-quality prefilter (σ=16, u=4, d=1, f=8): A low-order prefilter doesn't attenuate frequencies above the low-res Nyquist limit enough, so these leak into the low frequencies of the output, causing error at low frequencies. Intuitively, if you downsample an impulse by just averaging down non-overlapping boxes, at low resolution you get the same thing no matter where in its box the original impulse was. The subpixel location was lost. In the spatial domain these look like impulse responses that shift according to the pixel offset. In the frequency domain this shows up as a red hump of RMS error in the low frequencies.
	Subsampling too much (σ=8, u=4, d=4, f=8): The prefilter has finite support, so it can't perfectly bandlimit the signal before subsampling — some energy above the low-res Nyquist leaks through and aliases into the low frequencies. At f ≈ σ/2 this doesn't matter, because the low-res Gaussian blur is wide enough to discard those frequencies. But as f increases toward σ, the prefilter and interpolator consume most of the variance budget (their variance grows as f²), leaving a very narrow low-res blur that can't suppress the aliased energy. The result is visible low-frequency error. Limit the downsampling factor to at most σ/2.
	Ignoring resampling filter variance (σ=16, u=3, d=3, f=8): A common mistake is to simply use \sigma_\text{lo} = \sigma / f instead of the correct formula that subtracts the variance of the prefilter and interpolator. This over-blurs, producing an impulse response that is too wide. Compare with the sweet spot above.
u=3,d=2 u=2,d=3	Transpose symmetry (σ=12, f=8): Swapping interpolator and prefilter orders gives identical PSNR but visibly different artifacts. The u=3,d=2 case either blurs slightly too much or slightly too little depending on the pixel offset (bad low frequencies), while the u=2,d=3 case produces a piecewise linear output (bad high frequencies). See the proof below.

Pareto: Quality vs Throughput

Each point below represents one combination of interpolator order, prefilter order, and downsample factor for the current \sigma. The x-axis is PSNR (higher is better) and the y-axis is multiplies per pixel (fewer is better). Blue points lie on the Pareto frontier — no other setting is both cheaper and more accurate. The currently selected setting is highlighted in red. Click any point to view its impulse response on the right. Note that points on the Pareto frontier typically have u=d or u=d+1. High-quality prefilters are wasted on low-quality interpolators, and vice-versa.

Common Alternative Algorithms

The chart below compares several popular Gaussian blur approximations vs the one we recommend. Click on each in the legend to show or hide each one. The error vs ground truth is shown as a dashed/shaded region of the matching color.

σ: 8

Method	2D muls/pixel	Pros	Cons
Truncated Gaussian	2(8\sigma+1)	Exact (to truncation); simple; fully parallel. Easy to vectorize — can even be run on tensor cores (Zhang et al. 2026).	Cost is O(\sigma). Impractical for large \sigma. Truncation at 3\sigma saves ~25% but introduces visible error.
Triple box blur Wells 1986	2×4 = 8 (with Heckbert 1986)	O(1) per pixel via running sums; simple to implement. Can be fused into a single pass per dimension using repeated integration (Heckbert 1986).	Poor approximation quality (piecewise quadratic, not smooth). Visible flat top and short tails. Integer-width constraint limits precision. Prone to catastrophic cancellation on HDR inputs. Parallelization and vectorization straightforward, but adds to complexity of implementation and cost per pixel.
Young–Van Vliet Young & Van Vliet 1995	16 (3rd-order IIR)	O(1) per pixel regardless of \sigma; simple coefficient computation.	Sequential along each dimension. Moderate accuracy. Poorly behaved for very large \sigma. Parallelization possible but non-trivial (see Nehab et al. 2011). Vectorization possible by transposing or using scattered look-ahead (Parhi & Messerschmitt 1989), at additional cost. Prone to numerical instability on HDR input data. Boundary conditions tricky to get right (see Triggs & Sdika 2006).
Deriche Deriche 1993	32 (4th-order IIR)	Excellent accuracy; O(1) per pixel.	Sequential along each dimension. Higher cost than YVV (4th vs 3rd order). Same parallelization and numerical stability challenges.
Our preferred method (resampling)	~8 (u\!=\!d\!=\!3,\; f\!=\!\sigma/2)	Excellent accuracy; O(1) per pixel; fully data-parallel; naturally vectorizable.	Not strictly shift-invariant, which might be a showstopper in some contexts. Complex implementation with lots of details to get right, especially if you add TIIRs and/or FRM to accelerate the low-res blur.

References

Burt, P.J. & Adelson, E.H. (1983). The Laplacian Pyramid as a Compact Image Code. IEEE Trans. Communications 31(4):532–540.
Heckbert, P. (1986). Filtering by repeated integration. ACM SIGGRAPH Computer Graphics 20(4):315–321.
Lim, Y.C. (1986). Frequency-response masking approach for the synthesis of sharp linear phase digital filters. IEEE Trans. Circuits and Systems 33(4):357–364.
Wells, W.M. (1986). Efficient synthesis of Gaussian filters by cascaded uniform filters. IEEE TPAMI 8(2):234–239.
Parhi, K.K. & Messerschmitt, D.G. (1989). Pipeline interleaving and parallelism in recursive digital filters — Part I: Pipelining using scattered look-ahead and decomposition. IEEE Trans. ASSP 37(7):1099–1117.
Deriche, R. (1993). Recursively implementing the Gaussian and its derivatives. INRIA Research Report 1893.
Young, I.T. & Van Vliet, L.J. (1995). Recursive implementation of the Gaussian filter. Signal Processing 44(2):139–151.
Wang, A. & Smith, J.O. (1997). On fast FIR filters implemented as tail-canceling IIR filters. IEEE Trans. Signal Processing 45(6):1415–1427.
Triggs, B. & Sdika, M. (2006). Boundary conditions for Young–van Vliet recursive filtering. IEEE Trans. Signal Processing 54(6):2365–2367.
Farbman, Z., Fattal, R. & Lischinski, D. (2011). Convolution Pyramids. ACM Trans. Graphics (SIGGRAPH Asia) 30(6):175.
Nehab, D. et al. (2011). GPU-efficient recursive filtering and summed-area tables. ACM TOG 30(6):176.
Zhang, Y., Gerstmann, D., Adams, A. & Ahmad, M.B.S. (2026). Pushing Tensor Accelerators beyond MatMul in a User-Schedulable Language. CGO 2026.

Appendix: Transpose Symmetry

You may notice that swapping the interpolator and prefilter orders (e.g. u=3, d=2 vs u=2, d=3) produces identical PSNR despite visibly different impulse responses. This is not a bug — it is mathematically guaranteed. For this reason we only plot Pareto points where u \ge d — swapping u and d preserves PSNR and cost, so that variant lands in exactly the same place on the plot.

Claim: Swapping the interpolator and prefilter orders transposes the overall linear operator, preserving the Frobenius norm of the error.

Proof: Write the pipeline as a matrix acting on the input signal:

M = U \cdot Z \cdot G_\text{lo} \cdot S \cdot D

where D is convolution with the prefilter, S is subsampling by f, G_\text{lo} is the low-resolution Gaussian blur, Z is zero-fill upsampling by f, and U is convolution with the interpolator.

Two key facts:

For a symmetric convolution filter F, the matrix is symmetric: F^\top = F.
The transpose of subsampling is zero-fill, and vice versa: S^\top = Z and Z^\top = S (up to the factor of f which cancels between the two).

Taking the transpose:

M^\top = (U \cdot Z \cdot G_\text{lo} \cdot S \cdot D)^\top = D^\top \cdot S^\top \cdot G_\text{lo}^\top \cdot Z^\top \cdot U^\top = D \cdot Z \cdot G_\text{lo} \cdot S \cdot U

This is exactly the pipeline with the prefilter and interpolator swapped. Furthermore, the sum of their variances is the same (it depends only on u + d), so \sigma_\text{lo} and hence G_\text{lo} are unchanged.

Since the ground truth Gaussian G is also a symmetric convolution (G^\top = G), the Frobenius norm of the error is preserved:

\|M^\top - G\|_F = \|(M - G)^\top\|_F = \|M - G\|_F

The total mean squared error across all phases — and therefore the PSNR — is identical. The individual phase responses (rows of M vs rows of M^\top) differ, which is why the artifacts look different, but their total energy is the same. \blacksquare