KINTSUGI Performance Audit Report
Date: 2025-12-12 Scope: Complete codebase analysis for performance anti-patterns
Executive Summary
This audit identified 40+ performance issues across four categories:
N+1 Query Patterns: 5 database-related issues causing 10-30 queries instead of 1-2
Inefficient Algorithms: 12 O(n²) or worse patterns in image processing
Memory Inefficiencies: 12 unnecessary array copies and allocations
I/O Bottlenecks: 12 sequential/repeated file operations
Estimated Impact: 5-100x speedup potential in key image processing pipelines.
1. N+1 Query Patterns (Database)
1.1 Loop with DB Query Per Operation (CRITICAL)
File: src/kintsugi/mcp/tools/learning.py:268-341
operations = ["blank_subtraction", "denoise", "clahe", "clean_background", "gaussian_subtract"]
for operation in operations: # 5 iterations
learned = engine.recommend_parameters( # 2 DB queries per iteration
tissue_type=tissue_type,
marker_name=marker_name,
operation=operation,
...
)
Impact: 10 database queries (5 operations × 2 databases) instead of 2 batch queries.
Fix: Batch query all operations at once:
all_learned = engine.recommend_parameters_batch(
tissue_type=tissue_type,
marker_name=marker_name,
operations=operations,
)
1.2 Loop with DB Write Per Operation (CRITICAL)
File: src/kintsugi/mcp/tools/learning.py:521-539
for operation, parameters in operations_params.items():
result = await record_successful_parameters(...) # DB write per operation
Impact: Up to 10 database writes (5 operations × 2 databases).
Fix: Use batch inserts with a single transaction.
1.3 Sequential Queries in Statistics Gathering
File: src/kintsugi/claude/parameter_learning.py:823-865
cursor.execute("SELECT COUNT(*) FROM parameter_records") # Query 1
cursor.execute("SELECT DISTINCT tissue_type_normalized...") # Query 2
cursor.execute("SELECT DISTINCT marker_name_normalized...") # Query 3
cursor.execute("SELECT operation, COUNT(*), AVG(...)...") # Query 4
Impact: 8 total queries (4 per database).
Fix: Combine into single query with subqueries or CTEs.
1.4 Dual Database Queries in recommend_parameters()
File: src/kintsugi/claude/parameter_learning.py:477-504
Impact: Same query executed against project and global databases separately.
Fix: Use ATTACH DATABASE and UNION queries.
2. Inefficient Algorithms
2.1 O(n²) Patch Similarity Comparison (CRITICAL)
File: src/kintsugi/denoise/patch_based.py:167-187
for i, (ref_patch, (ref_y, ref_x)) in enumerate(zip(patches, positions)):
for j, (_other_patch, (other_y, other_x)) in enumerate(zip(patches, positions)):
if i == j:
continue
diff = np.sum((patch_dcts[j] - ref_dct) ** 2)
Impact: O(n²) for patch comparison; dominates BM3D-lite runtime.
Fix: Use KD-tree or ball tree for spatial indexing:
from scipy.spatial import cKDTree
tree = cKDTree(positions)
neighbors = tree.query_ball_point([ref_y, ref_x], r=half_window)
2.2 O(n⁴) Pixel-Level NLM Denoising (CRITICAL)
File: src/kintsugi/denoise/patch_based.py:369-388
for py in range(offset, h + offset):
for px in range(offset, w + offset): # O(n²) image pixels
for dy in range(-search_radius, search_radius + 1):
for dx in range(-search_radius, search_radius + 1): # O(m²) search window
dist = np.sum((ref_patch - comp_patch) ** 2)
Impact: O(n⁴) complexity makes this unusable for large images.
Fix: Use scipy’s scipy.ndimage.uniform_filter with strides, or OpenCV’s cv2.fastNlMeansDenoising.
2.3 Unvectorized Overlap Matrix Computation
File: src/kintsugi/segment/postprocess.py:410-415
for i in range(labels1.shape[0]):
for j in range(labels1.shape[1]):
l1 = labels1[i, j]
l2 = labels2[i, j]
overlap[l1, l2] += 1
Impact: O(n²) pixel-by-pixel when O(n) vectorized solution exists.
Fix:
overlap = np.zeros((max_label1 + 1, max_label2 + 1), dtype=np.int64)
np.add.at(overlap, (labels1.ravel(), labels2.ravel()), 1)
2.4 Suboptimal K-NN with Full Sort
File: src/kintsugi/qc/cell_qc.py:504
neighbor_indices = np.argsort(distances[i])[1 : n_neighbors + 1]
Impact: O(n log n) sorting when only top-k needed.
Fix:
neighbor_indices = np.argpartition(distances[i], n_neighbors)[1 : n_neighbors + 1]
2.5 Repeated Morphology Per Label
File: src/kintsugi/segment/postprocess.py:237-245
for label in unique_labels:
mask = labels == label
for _ in range(iterations):
mask = morphology.binary_closing(mask, morphology.disk(1))
mask = morphology.binary_opening(mask, morphology.disk(1))
Impact: N labels × M iterations × 2 morphology ops.
Fix: Use scipy.ndimage.label with vectorized operations on all labels simultaneously.
2.6 Unvectorized Peak Detection
File: src/kintsugi/qc/marker_qc.py:246-249
peaks = []
for i in range(1, len(hist) - 1):
if hist[i] > hist[i - 1] and hist[i] > hist[i + 1]:
peaks.append(i)
Fix:
peaks = np.where((hist[1:-1] > hist[:-2]) & (hist[1:-1] > hist[2:]))[0] + 1
2.7 Custom Otsu’s Threshold Instead of Library
File: src/kintsugi/qc/marker_qc.py:269-305
Impact: 256-iteration loop reimplementing optimized library code.
Fix:
from skimage.filters import threshold_otsu
threshold = threshold_otsu(intensities)
3. Memory Inefficiencies
3.1 Unnecessary Dask Array Copies (CRITICAL)
File: src/kintsugi/mcp/tools/signal_isolation.py
Line |
Code |
Issue |
|---|---|---|
472 |
|
Forces computation |
590 |
|
Doubles memory |
727 |
|
Unnecessary |
Impact: For 8000×8000 uint16 images, each copy uses ~128MB.
Fix: Remove explicit copies; dask handles immutability internally.
3.2 Full-Size Mask Allocation Per Tile
File: src/kintsugi/segment/sam_wrapper.py:314-318
for mask in tile_masks:
full_mask = np.zeros((h, w), dtype=bool) # Full image size per mask!
full_mask[y:y_end, x:x_end] = seg
Impact: 100 masks × 8000×8000 = ~6.4GB memory.
Fix: Store tile-relative coordinates or use sparse arrays.
3.3 Inefficient Patch List Building
File: src/kintsugi/denoise/patch_based.py:33-49
patches = []
for y in range(...):
for x in range(...):
patches.append(image[y:y+patch_size, x:x+patch_size])
return np.array(patches) # Converts list to array
Fix: Pre-allocate array:
n_patches = ((h - patch_size) // step + 1) * ((w - patch_size) // step + 1)
patches = np.zeros((n_patches, patch_size, patch_size), dtype=image.dtype)
3.4 Multiple Normalization Copies
File: src/kintsugi/denoise/filters.py:214-285
img_norm = (img - img_min) / (img_max - img_min) # Copy 1
# ... process ...
return result * (img_max - img_min) + img_min # Copy 2
Fix: Use in-place operations where possible.
3.5 Multi-Pass Statistics Computation
File: src/kintsugi/mcp/tools/visualization.py:43-91
"min": float(np.min(data_sample)), # Pass 1
"max": float(np.max(data_sample)), # Pass 2
"mean": float(np.mean(data_sample)), # Pass 3
"std": float(np.std(data_sample)), # Pass 4
Fix: Single pass with running statistics or scipy.stats.describe().
3.6 Full Array Load for Line Profile
File: src/kintsugi/mcp/tools/visualization.py:294-295
if hasattr(data, "compute"):
data = data.compute() # Loads entire image for single line!
Fix:
profile = data[y, :].compute() if axis == "horizontal" else data[:, x].compute()
4. I/O Bottlenecks
4.1 Repeated File Metadata Checks (CRITICAL)
File: notebooks/Kreg/slide_tools.py:176-228
def get_img_type(img_f):
slide_io.check_is_ome(str(img_f)) # Opens file
slide_io.check_to_use_vips(str(img_f)) # Opens file again
slide_io.check_to_use_openslide(str(img_f)) # Opens file again
Impact: 3-5 file operations per file × 100 files = 300-500 syscalls.
Fix: Add @lru_cache(maxsize=256) decorator:
@lru_cache(maxsize=256)
def get_img_type(img_f):
...
4.2 Sequential Image Loading (CRITICAL)
File: notebooks/Kreg/serial_non_rigid.py:101-106
img_list = [io.imread(os.path.join(src_dir, f)) for f in img_f_list]
Impact: 10-20x slower than parallel loading for 50+ images.
Fix:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=8) as executor:
img_list = list(executor.map(io.imread, img_paths))
4.3 Multiple Glob Patterns per Directory
File: src/kintsugi/mcp/tools/workflow.py:59-92
patterns = ["*.tif", "*.tiff", "*.zarr", "*.ome.tif", "*.ome.tiff"]
for pattern in patterns:
for f in cycle_dir.glob(pattern): # 5 filesystem scans!
Fix: Single scan with combined pattern:
all_files = list(cycle_dir.iterdir())
matches = [f for f in all_files if f.suffix.lower() in {'.tif', '.tiff', '.zarr'}]
4.4 Missing Caching on File Type Functions
File: notebooks/Kreg/slide_io.py:429-528
Functions check_is_ome(), check_to_use_openslide(), check_to_use_vips() have no caching.
Fix: Add @lru_cache decorators to all file-checking functions.
4.5 Inconsistent Parallel vs Sequential I/O
File: src/kintsugi/zarr_io.py
Lines |
Pattern |
Performance |
|---|---|---|
765-773 |
Sequential |
Slow |
1179-1180 |
Parallel |
Fast |
Fix: Standardize on parallel pattern throughout.
4.6 Full Array Load for Thumbnails
File: src/kintsugi/mcp/tools/visualization.py:122-130
data = data[::step, ::step].compute() # Still loads full array first
Fix: Use dask’s native downsampling before compute.
Priority Matrix
Priority |
Issue |
Location |
Est. Speedup |
|---|---|---|---|
P0 |
O(n⁴) NLM denoising |
patch_based.py:369 |
100x+ |
P0 |
O(n²) patch similarity |
patch_based.py:167 |
10-50x |
P0 |
Sequential image loading |
serial_non_rigid.py:101 |
10-20x |
P1 |
Repeated file metadata |
slide_tools.py:176 |
3-5x |
P1 |
N+1 database queries |
learning.py:268 |
5-10x |
P1 |
Full-size mask allocation |
sam_wrapper.py:314 |
2-5x memory |
P2 |
Dask array copies |
signal_isolation.py |
2x memory |
P2 |
Unvectorized overlap |
postprocess.py:410 |
5-10x |
P2 |
Multiple glob patterns |
workflow.py:59 |
2-3x |
P3 |
K-NN with full sort |
cell_qc.py:504 |
3-5x |
P3 |
Multi-pass statistics |
visualization.py:43 |
2x |
Recommendations
Immediate Actions (P0)
Replace pixel-level NLM with
cv2.fastNlMeansDenoising()or scipy equivalentAdd spatial indexing (KD-tree) for patch matching in BM3D-lite
Parallelize image loading with
ThreadPoolExecutor
Short-term (P1)
Add
@lru_cacheto all file type checking functionsBatch database operations in learning.py
Use sparse arrays or tile-relative coordinates for SAM masks
Medium-term (P2)
Remove unnecessary dask array copies
Vectorize overlap matrix computation
Consolidate glob patterns into single directory scan
Long-term (P3)
Replace
np.argsortwithnp.argpartitionfor top-kUse single-pass statistics computation
Standardize on async I/O for MCP tools
Testing Recommendations
After implementing fixes:
Benchmark with representative image sizes (2K, 4K, 8K pixels)
Profile memory usage with
memory_profilerTest with various batch sizes (10, 50, 100 images)
Verify numerical accuracy is preserved after optimizations