Algorithm Optimization Plan: BM3D-lite & NLM Denoising

Overview

This document outlines the plan for optimizing two critical image denoising algorithms:

  1. BM3D-lite: O(n²) patch matching → O(n log n) with KD-tree spatial indexing

  2. NLM (Non-Local Means): O(n⁴) pixel-level → O(n²) with OpenCV’s optimized implementation

Current Implementation Analysis

BM3D-lite (denoise_bm3d_lite)

  • Location: src/kintsugi/denoise/patch_based.py:97-256

  • Current Complexity: O(n²) where n = number of patches

  • Bottleneck: Lines 167-187 - nested loop comparing all patches

for i, (ref_patch, (ref_y, ref_x)) in enumerate(zip(patches, positions)):
    for j, (_other_patch, (other_y, other_x)) in enumerate(zip(patches, positions)):
        if abs(other_y - ref_y) <= half_window and abs(other_x - ref_x) <= half_window:
            diff = np.sum((patch_dcts[j] - ref_dct) ** 2)

NLM (denoise_patch_similarity)

  • Location: src/kintsugi/denoise/patch_based.py:259-395

  • Current Complexity: O(n⁴) for non-fast mode, O(n²) for fast mode

  • Bottleneck: Lines 369-388 - four nested loops for pixel-by-pixel processing

for y in range(h):
    for x in range(w):
        for dy in range(-search_radius, search_radius + 1):
            for dx in range(-search_radius, search_radius + 1):

Optimization Strategy

1. BM3D-lite: KD-Tree Spatial Indexing

Approach: Use scipy.spatial.cKDTree to find patches within the search window in O(log n) instead of O(n).

Implementation:

from scipy.spatial import cKDTree

# Build spatial index on patch positions (one-time cost: O(n log n))
positions_array = np.array(positions)
tree = cKDTree(positions_array)

for i, (ref_patch, (ref_y, ref_x)) in enumerate(zip(patches, positions)):
    # Query patches within search window: O(log n + k)
    candidate_indices = tree.query_ball_point([ref_y, ref_x], r=half_window)

    # Filter by DCT similarity (only on candidates)
    for j in candidate_indices:
        if i == j:
            continue
        diff = np.sum((patch_dcts[j] - ref_dct) ** 2)
        ...

Expected Speedup: 10-50x for typical image sizes (1000+ patches)

Risks:

  • Numerical differences from different patch ordering

  • Edge cases with very few or very many patches in window

2. NLM: OpenCV Replacement

Approach: Replace custom implementation with cv2.fastNlMeansDenoising() which uses optimized C++ with integral images.

Implementation:

import cv2

def denoise_nlm_optimized(image, h_param=None, patch_size=7, search_radius=21):
    img = _to_numpy(image)

    # Convert to uint8 for OpenCV (with proper scaling)
    if img.dtype != np.uint8:
        img_min, img_max = img.min(), img.max()
        img_scaled = ((img - img_min) / (img_max - img_min) * 255).astype(np.uint8)
    else:
        img_scaled = img
        img_min, img_max = 0, 255

    # OpenCV NLM
    template_window = patch_size
    search_window = search_radius * 2 + 1
    denoised = cv2.fastNlMeansDenoising(img_scaled, None, h_param, template_window, search_window)

    # Scale back to original range
    result = denoised.astype(np.float64) / 255 * (img_max - img_min) + img_min
    return result.astype(image.dtype)

Expected Speedup: 100-1000x

Risks:

  • OpenCV only supports uint8 input (requires scaling)

  • Parameter mapping differences (h_param interpretation)

  • Slight numerical differences due to implementation details


Validation Framework

Test Categories

  1. Synthetic Images: Known ground truth for quantitative metrics

  2. Real Images: Representative microscopy images from KINTSUGI workflows

  3. Edge Cases: Very noisy, very clean, small, large images

Quality Metrics

Metric

Description

Acceptance Criteria

PSNR

Peak Signal-to-Noise Ratio

Δ < 0.5 dB from original

SSIM

Structural Similarity Index

Δ < 0.01 from original

MSE

Mean Squared Error

Δ < 5% from original

Visual

Side-by-side comparison

No visible artifacts

Performance Metrics

Metric

Description

Target

Wall Time

Execution time

>5x improvement

Memory Peak

Maximum memory usage

No increase

Scaling

Time vs image size

Linear or better

Test Images

Image

Size

Noise Level

Purpose

Synthetic gradient

256×256

σ=15, 25, 50

Quantitative PSNR

Synthetic shapes

512×512

σ=25

Edge preservation

Real DAPI channel

1024×1024

Natural

Practical validation

Real marker channel

2048×2048

Natural

Large image scaling

Stress test

4096×4096

σ=25

Performance limits


Test Implementation

Phase 1: Baseline Measurements

  1. Run current implementations on all test images

  2. Record timing, PSNR, SSIM for each

  3. Save denoised outputs for visual comparison

Phase 2: Optimized Implementation

  1. Implement KD-tree version of BM3D-lite

  2. Implement OpenCV wrapper for NLM

  3. Ensure API compatibility (same function signatures)

Phase 3: Comparative Testing

  1. Run optimized versions on identical test images

  2. Compare quality metrics to baseline

  3. Compare performance metrics to baseline

  4. Generate visual difference maps

Phase 4: Validation Criteria

PASS conditions:

  • PSNR difference < 0.5 dB on all synthetic images

  • SSIM difference < 0.01 on all images

  • No visible artifacts in visual inspection

  • Performance improvement > 5x

FAIL conditions:

  • PSNR drops by > 1 dB on any image

  • Visible artifacts (blocking, ringing, etc.)

  • Edge blurring significantly worse than original


Rollback Plan

If optimizations fail validation:

  1. Keep original implementations as default

  2. Add optimized versions as _fast variants with appropriate warnings

  3. Document known limitations in docstrings


Timeline

Phase

Tasks

Est. Time

1

Create test framework, generate test images

1-2 hours

2

Implement BM3D-lite optimization

1-2 hours

3

Implement NLM replacement

1 hour

4

Run validation tests

1 hour

5

Document results, create PR

1 hour


Files to Create/Modify

New Files

  • tests/test_denoise_optimization.py - Validation test suite

  • src/kintsugi/denoise/patch_based_optimized.py - Optimized implementations (temporary for testing)

Modified Files (after validation)

  • src/kintsugi/denoise/patch_based.py - Replace with optimized versions


Success Criteria Summary

Algorithm

Quality

Performance

Decision

BM3D-lite KD-tree

PSNR ±0.5dB, SSIM ±0.01

>10x faster

Merge if both pass

NLM OpenCV

PSNR ±0.5dB, SSIM ±0.01

>50x faster

Merge if both pass