KINTSUGI Processing Speed Optimization - Session Summary

Date: December 9, 2025 Context: Addressing processing time regression from ~15 min to ~60-175 min after recent changes

Background

Previous Work Completed

Optimized GPU BaSiC SVD with power iteration (10x faster than full SVD)
Restored ThreadPoolExecutor parallelism with GPU BaSiC (8 workers)
Added timing per stage and progress logging (ProgressCounter class)
Changed output to TIFF for fair speed comparison (excluding zarr overhead)

Current Implementation Files

src/kintsugi/kcorrect_gpu.py - GPU BaSiC illumination correction
notebooks/Kstitch/stitching.py - Image stitching with GPU FFT
notebooks/2_Cycle_Processing.ipynb - Main processing pipeline
notebooks/1_Single_Channel_Eval.ipynb - Parameter tuning

Repository Review Findings

Repositories Analyzed

cecelia (smith6jt-cop) - R/napari image analysis framework
PyBaSiCCellprofilerPlugin (smith6jt-cop) - BaSiC for CellProfiler
m2stitch (smith6jt-cop) - MIST-inspired stitching
mcmicro (smith6jt-cop) - Nextflow multiplex pipeline
ashlar (smith6jt76) - Stitching/registration tool
cylinter (smith6jt) - QC for multiplex imaging
RAPID (smith6jt-cop) - MATLAB processing pipeline
BaSiCPy (peng-lab) - JAX-based BaSiC implementation

Key Optimization Techniques Found

From BaSiCPy (peng-lab)

JAX provides ~6x speedup over CuPy/Numba through JIT compilation
Device-agnostic arrays with jnp (JAX NumPy)
3D DCT transforms with JaxDCT.dct3d()
Multi-worker parallelization parameter
User Decision: Stick with CuPy optimization only (no JAX dependency)

From MIST (NIST)

Hybrid CPU-GPU pipelining: 24x speedup (59x42 tiles in 26 seconds)
cuFFT with CUDA kernels for NCC computation
Pipelining overlaps CPU data loading with GPU computation (11.2x speedup)
Source: NIST MIST Paper

From PyBaSiCCellprofilerPlugin

Caching mechanism: Compute flatfield/darkfield once, reuse across cycles
Memory efficiency: Store baseline drift as per-image array

From RAPID

Parallel Computing with configurable worker counts
GPU acceleration via CUDA for deconvolution
FFT-based convolution (ConvFFT3_S) for 3D operations
Batch processing of tiles, cycles, regions

Proposed Implementation Plan

Phase 1: BaSiC Illumination Correction

Implement caching for flatfield/darkfield
Optimize DCT/IDCT with CuPy
Keep power iteration SVD (already implemented)

Phase 2: Stitching Optimization

GPU pipelining (overlap CPU/GPU work)
Batch FFT with async memory transfers
Pre-allocated GPU memory buffers

Phase 3: Parallel Processing

Optimized ThreadPoolExecutor
GPU for heavy compute, CPU for I/O
Memory-mapped file I/O

Phase 4: Output Format

TIFF as default for speed
OME-Zarr as optional post-processing

Expected Gains

Optimization	Expected Speedup
BaSiC caching	2-3x
GPU pipelining	2-4x
Batch FFT	1.5-2x
Memory pre-allocation	1.2-1.5x
Combined	5-10x potential

IMPORTANT: BaSiC Caching Validation Required

User Concern

The current per-z-plane BaSiC processing may produce better results than caching. Need objective validation before implementing caching.

Test Design

Test Cases

DAPI (CH1) - High signal, present in all cycles
Blank channels (Cycle 1 & 13, CH2-4) - Background/noise only
Sparse marker (Cycle 2 CH3) - Fewer positive cells
Dense marker (Cycle 3 CH3) - More positive cells

Processing Modes to Compare

Mode A (Current): Compute BaSiC per z-plane individually
Mode B (Cached): Compute BaSiC once from reference plane, apply to all z-planes

Metrics to Evaluate

Intensity Statistics - Mean, std, min, max, CV across tiles
Flatfield Quality - Uniformity (std/mean), center-to-edge ratio
Darkfield Quality - Magnitude and pattern
Corrected Image Quality - Tile boundary artifacts, inter-tile variation, SNR
Biological Signal - Segmentation consistency, positive cell detection, blank residuals

Success Criteria

Pass: Cached mode within 5% of individual mode on all metrics
Fail: Any metric differs >10% or visible artifacts

Files to Modify

File	Changes
`src/kintsugi/kcorrect_gpu.py`	Add caching, optimize DCT with CuPy
`notebooks/Kstitch/stitching.py`	GPU pipelining, batch FFT, memory pre-allocation
`notebooks/Kstitch/_translation_computation.py`	GPU-optimized NCC
`notebooks/1_Single_Channel_Eval.ipynb`	Use optimized BaSiC
`notebooks/2_Cycle_Processing.ipynb`	Parallel processing, caching, memory mapping