KINTSUGI Processing Speed Optimization - Session Summary

Date: December 9, 2025 Context: Addressing processing time regression from ~15 min to ~60-175 min after recent changes


Background

Previous Work Completed

  1. Optimized GPU BaSiC SVD with power iteration (10x faster than full SVD)

  2. Restored ThreadPoolExecutor parallelism with GPU BaSiC (8 workers)

  3. Added timing per stage and progress logging (ProgressCounter class)

  4. Changed output to TIFF for fair speed comparison (excluding zarr overhead)

Current Implementation Files

  • src/kintsugi/kcorrect_gpu.py - GPU BaSiC illumination correction

  • notebooks/Kstitch/stitching.py - Image stitching with GPU FFT

  • notebooks/2_Cycle_Processing.ipynb - Main processing pipeline

  • notebooks/1_Single_Channel_Eval.ipynb - Parameter tuning


Repository Review Findings

Repositories Analyzed

  1. cecelia (smith6jt-cop) - R/napari image analysis framework

  2. PyBaSiCCellprofilerPlugin (smith6jt-cop) - BaSiC for CellProfiler

  3. m2stitch (smith6jt-cop) - MIST-inspired stitching

  4. mcmicro (smith6jt-cop) - Nextflow multiplex pipeline

  5. ashlar (smith6jt76) - Stitching/registration tool

  6. cylinter (smith6jt) - QC for multiplex imaging

  7. RAPID (smith6jt-cop) - MATLAB processing pipeline

  8. BaSiCPy (peng-lab) - JAX-based BaSiC implementation

Key Optimization Techniques Found

From BaSiCPy (peng-lab)

  • JAX provides ~6x speedup over CuPy/Numba through JIT compilation

  • Device-agnostic arrays with jnp (JAX NumPy)

  • 3D DCT transforms with JaxDCT.dct3d()

  • Multi-worker parallelization parameter

  • User Decision: Stick with CuPy optimization only (no JAX dependency)

From MIST (NIST)

  • Hybrid CPU-GPU pipelining: 24x speedup (59x42 tiles in 26 seconds)

  • cuFFT with CUDA kernels for NCC computation

  • Pipelining overlaps CPU data loading with GPU computation (11.2x speedup)

  • Source: NIST MIST Paper

From PyBaSiCCellprofilerPlugin

  • Caching mechanism: Compute flatfield/darkfield once, reuse across cycles

  • Memory efficiency: Store baseline drift as per-image array

From RAPID

  • Parallel Computing with configurable worker counts

  • GPU acceleration via CUDA for deconvolution

  • FFT-based convolution (ConvFFT3_S) for 3D operations

  • Batch processing of tiles, cycles, regions


Proposed Implementation Plan

Phase 1: BaSiC Illumination Correction

  • Implement caching for flatfield/darkfield

  • Optimize DCT/IDCT with CuPy

  • Keep power iteration SVD (already implemented)

Phase 2: Stitching Optimization

  • GPU pipelining (overlap CPU/GPU work)

  • Batch FFT with async memory transfers

  • Pre-allocated GPU memory buffers

Phase 3: Parallel Processing

  • Optimized ThreadPoolExecutor

  • GPU for heavy compute, CPU for I/O

  • Memory-mapped file I/O

Phase 4: Output Format

  • TIFF as default for speed

  • OME-Zarr as optional post-processing

Expected Gains

Optimization

Expected Speedup

BaSiC caching

2-3x

GPU pipelining

2-4x

Batch FFT

1.5-2x

Memory pre-allocation

1.2-1.5x

Combined

5-10x potential


IMPORTANT: BaSiC Caching Validation Required

User Concern

The current per-z-plane BaSiC processing may produce better results than caching. Need objective validation before implementing caching.

Test Design

Test Cases

  1. DAPI (CH1) - High signal, present in all cycles

  2. Blank channels (Cycle 1 & 13, CH2-4) - Background/noise only

  3. Sparse marker (Cycle 2 CH3) - Fewer positive cells

  4. Dense marker (Cycle 3 CH3) - More positive cells

Processing Modes to Compare

  • Mode A (Current): Compute BaSiC per z-plane individually

  • Mode B (Cached): Compute BaSiC once from reference plane, apply to all z-planes

Metrics to Evaluate

  1. Intensity Statistics - Mean, std, min, max, CV across tiles

  2. Flatfield Quality - Uniformity (std/mean), center-to-edge ratio

  3. Darkfield Quality - Magnitude and pattern

  4. Corrected Image Quality - Tile boundary artifacts, inter-tile variation, SNR

  5. Biological Signal - Segmentation consistency, positive cell detection, blank residuals

Success Criteria

  • Pass: Cached mode within 5% of individual mode on all metrics

  • Fail: Any metric differs >10% or visible artifacts


Files to Modify

File

Changes

src/kintsugi/kcorrect_gpu.py

Add caching, optimize DCT with CuPy

notebooks/Kstitch/stitching.py

GPU pipelining, batch FFT, memory pre-allocation

notebooks/Kstitch/_translation_computation.py

GPU-optimized NCC

notebooks/1_Single_Channel_Eval.ipynb

Use optimized BaSiC

notebooks/2_Cycle_Processing.ipynb

Parallel processing, caching, memory mapping


Constraints

  • Must remain pure Python (CuPy for CUDA)

  • Must maintain transparent, tunable parameters

  • Must preserve current API and notebook structure

  • GPU acceleration optional (CPU fallback required)


Next Steps

  1. Create and run BaSiC caching validation test

  2. Based on results, decide whether to implement caching

  3. Implement other optimizations (GPU pipelining, batch FFT, etc.)

  4. Benchmark against ~15 min baseline


References