Algorithm Optimization Plan: BM3D-lite & NLM Denoising
Overview
This document outlines the plan for optimizing two critical image denoising algorithms:
BM3D-lite: O(n²) patch matching → O(n log n) with KD-tree spatial indexing
NLM (Non-Local Means): O(n⁴) pixel-level → O(n²) with OpenCV’s optimized implementation
Current Implementation Analysis
BM3D-lite (denoise_bm3d_lite)
Location:
src/kintsugi/denoise/patch_based.py:97-256Current Complexity: O(n²) where n = number of patches
Bottleneck: Lines 167-187 - nested loop comparing all patches
for i, (ref_patch, (ref_y, ref_x)) in enumerate(zip(patches, positions)):
for j, (_other_patch, (other_y, other_x)) in enumerate(zip(patches, positions)):
if abs(other_y - ref_y) <= half_window and abs(other_x - ref_x) <= half_window:
diff = np.sum((patch_dcts[j] - ref_dct) ** 2)
NLM (denoise_patch_similarity)
Location:
src/kintsugi/denoise/patch_based.py:259-395Current Complexity: O(n⁴) for non-fast mode, O(n²) for fast mode
Bottleneck: Lines 369-388 - four nested loops for pixel-by-pixel processing
for y in range(h):
for x in range(w):
for dy in range(-search_radius, search_radius + 1):
for dx in range(-search_radius, search_radius + 1):
Optimization Strategy
1. BM3D-lite: KD-Tree Spatial Indexing
Approach: Use scipy.spatial.cKDTree to find patches within the search window in O(log n) instead of O(n).
Implementation:
from scipy.spatial import cKDTree
# Build spatial index on patch positions (one-time cost: O(n log n))
positions_array = np.array(positions)
tree = cKDTree(positions_array)
for i, (ref_patch, (ref_y, ref_x)) in enumerate(zip(patches, positions)):
# Query patches within search window: O(log n + k)
candidate_indices = tree.query_ball_point([ref_y, ref_x], r=half_window)
# Filter by DCT similarity (only on candidates)
for j in candidate_indices:
if i == j:
continue
diff = np.sum((patch_dcts[j] - ref_dct) ** 2)
...
Expected Speedup: 10-50x for typical image sizes (1000+ patches)
Risks:
Numerical differences from different patch ordering
Edge cases with very few or very many patches in window
2. NLM: OpenCV Replacement
Approach: Replace custom implementation with cv2.fastNlMeansDenoising() which uses optimized C++ with integral images.
Implementation:
import cv2
def denoise_nlm_optimized(image, h_param=None, patch_size=7, search_radius=21):
img = _to_numpy(image)
# Convert to uint8 for OpenCV (with proper scaling)
if img.dtype != np.uint8:
img_min, img_max = img.min(), img.max()
img_scaled = ((img - img_min) / (img_max - img_min) * 255).astype(np.uint8)
else:
img_scaled = img
img_min, img_max = 0, 255
# OpenCV NLM
template_window = patch_size
search_window = search_radius * 2 + 1
denoised = cv2.fastNlMeansDenoising(img_scaled, None, h_param, template_window, search_window)
# Scale back to original range
result = denoised.astype(np.float64) / 255 * (img_max - img_min) + img_min
return result.astype(image.dtype)
Expected Speedup: 100-1000x
Risks:
OpenCV only supports uint8 input (requires scaling)
Parameter mapping differences (h_param interpretation)
Slight numerical differences due to implementation details
Validation Framework
Test Categories
Synthetic Images: Known ground truth for quantitative metrics
Real Images: Representative microscopy images from KINTSUGI workflows
Edge Cases: Very noisy, very clean, small, large images
Quality Metrics
Metric |
Description |
Acceptance Criteria |
|---|---|---|
PSNR |
Peak Signal-to-Noise Ratio |
Δ < 0.5 dB from original |
SSIM |
Structural Similarity Index |
Δ < 0.01 from original |
MSE |
Mean Squared Error |
Δ < 5% from original |
Visual |
Side-by-side comparison |
No visible artifacts |
Performance Metrics
Metric |
Description |
Target |
|---|---|---|
Wall Time |
Execution time |
>5x improvement |
Memory Peak |
Maximum memory usage |
No increase |
Scaling |
Time vs image size |
Linear or better |
Test Images
Image |
Size |
Noise Level |
Purpose |
|---|---|---|---|
Synthetic gradient |
256×256 |
σ=15, 25, 50 |
Quantitative PSNR |
Synthetic shapes |
512×512 |
σ=25 |
Edge preservation |
Real DAPI channel |
1024×1024 |
Natural |
Practical validation |
Real marker channel |
2048×2048 |
Natural |
Large image scaling |
Stress test |
4096×4096 |
σ=25 |
Performance limits |
Test Implementation
Phase 1: Baseline Measurements
Run current implementations on all test images
Record timing, PSNR, SSIM for each
Save denoised outputs for visual comparison
Phase 2: Optimized Implementation
Implement KD-tree version of BM3D-lite
Implement OpenCV wrapper for NLM
Ensure API compatibility (same function signatures)
Phase 3: Comparative Testing
Run optimized versions on identical test images
Compare quality metrics to baseline
Compare performance metrics to baseline
Generate visual difference maps
Phase 4: Validation Criteria
PASS conditions:
PSNR difference < 0.5 dB on all synthetic images
SSIM difference < 0.01 on all images
No visible artifacts in visual inspection
Performance improvement > 5x
FAIL conditions:
PSNR drops by > 1 dB on any image
Visible artifacts (blocking, ringing, etc.)
Edge blurring significantly worse than original
Rollback Plan
If optimizations fail validation:
Keep original implementations as default
Add optimized versions as
_fastvariants with appropriate warningsDocument known limitations in docstrings
Timeline
Phase |
Tasks |
Est. Time |
|---|---|---|
1 |
Create test framework, generate test images |
1-2 hours |
2 |
Implement BM3D-lite optimization |
1-2 hours |
3 |
Implement NLM replacement |
1 hour |
4 |
Run validation tests |
1 hour |
5 |
Document results, create PR |
1 hour |
Files to Create/Modify
New Files
tests/test_denoise_optimization.py- Validation test suitesrc/kintsugi/denoise/patch_based_optimized.py- Optimized implementations (temporary for testing)
Modified Files (after validation)
src/kintsugi/denoise/patch_based.py- Replace with optimized versions
Success Criteria Summary
Algorithm |
Quality |
Performance |
Decision |
|---|---|---|---|
BM3D-lite KD-tree |
PSNR ±0.5dB, SSIM ±0.01 |
>10x faster |
Merge if both pass |
NLM OpenCV |
PSNR ±0.5dB, SSIM ±0.01 |
>50x faster |
Merge if both pass |