atris opened a new issue, #14997: URL: https://github.com/apache/lucene/issues/14997
### Description **Summary** This RFC proposes implementing SPANN (Space Partition Approximate Nearest Neighbor) as an additional vector format for Lucene. SPANN provides an alternative approach to vector search that excels in different scenarios than our current HNSW implementation, particularly for memory-constrained environments and filtered search workloads. SPANN uses spatial partitioning instead of graph structures, enabling different performance characteristics and resource usage patterns. This provides users with choice in vector search implementations based on their specific requirements. **Problem Statement** Production vector search deployments often face constraints that current implementations struggle to address: Memory scalability: Large vector indices require substantial memory for optimal performance, limiting deployment options and increasing infrastructure costs. Filtered search efficiency: When combining vector similarity with document filters, current approaches examine vectors that may not match filter criteria. Merge resource usage: Segment merge operations for vector data can require significant temporary memory allocation. Hardware flexibility: Different deployment environments have varying memory, CPU, and storage characteristics that could benefit from different algorithmic approaches. **SPANN Solution Overview** SPANN addresses these challenges through spatial partitioning. Vectors are clustered into spatial regions during indexing, with only lightweight partition metadata kept in memory. Search operations filter partitions based on spatial proximity before examining individual vectors. This approach provides: - Bounded memory usage independent of index size - Efficient filtered search through partition pre-filtering - Linear merge complexity with predictable resource usage - Sequential I/O patterns optimized for modern storage **Technical Architecture** **Core Algorithm Details** Partitioning Algorithm: SPANN uses k-means++ initialization followed by Lloyd's iteration for partition creation. The k-means++ initialization provides better initial centroid placement than random selection, leading to higher quality partitions: // K-means++ initialization float[][] selectInitialCentroids(float[][] vectors, int k) { float[][] centroids = new float[k][]; centroids[0] = vectors[random.nextInt(vectors.length)]; for (int i = 1; i < k; i++) { float[] distances = computeMinDistances(vectors, centroids, i); float[] probabilities = computeProbabilities(distances); centroids[i] = selectWithProbability(vectors, probabilities); } return centroids; } // Lloyd's iteration for convergence void refinePartitions(float[][] vectors, float[][] centroids, int maxIters) { for (int iter = 0; iter < maxIters; iter++) { int[] assignments = assignToNearestCentroid(vectors, centroids); float[][] newCentroids = recomputeCentroids(vectors, assignments); if (hasConverged(centroids, newCentroids, CONVERGENCE_THRESHOLD)) { break; } centroids = newCentroids; } } Partition Count Selection: The number of partitions is determined dynamically based on vector count and dimensionality: int computeOptimalPartitionCount(int numVectors, int dimensions) { // Target ~10K vectors per partition for optimal I/O int basePartitions = (int) Math.ceil(numVectors / 10000.0); // Adjust for high dimensions which need smaller partitions double dimensionFactor = Math.sqrt(dimensions / 128.0); int adjustedPartitions = (int) (basePartitions * dimensionFactor); // Bounds to ensure reasonable partition counts return Math.max(10, Math.min(10000, adjustedPartitions)); } Partition Selection During Search: The number of partitions to examine is determined by the search k value and partition quality: int computePartitionsToExamine(int k, int totalPartitions, float[] queryVector, float[][] centroids) { // Base expansion factor for approximate search double expansionFactor = 1.5 + Math.log10(k); int baseExamine = (int) Math.ceil(k * expansionFactor); // Adjust based on query's distance to nearest centroids float nearestDist = Float.MAX_VALUE; float secondDist = Float.MAX_VALUE; for (float[] centroid : centroids) { float dist = computeDistance(queryVector, centroid); if (dist < nearestDist) { secondDist = nearestDist; nearestDist = dist; } else if (dist < secondDist) { secondDist = dist; } } // If query is very close to one centroid, examine fewer partitions float ratio = nearestDist / secondDist; if (ratio < 0.5) { baseExamine = (int) (baseExamine * 0.7); } // Never examine more than 10% of partitions return Math.min(baseExamine, totalPartitions / 10); } **File Format Specifications** Metadata File (.spm): Header: - Magic: int32 (0x53504D30) // "SPM0" - Version: int32 - CodecHeader: standard Lucene codec header Partition Metadata: - NumPartitions: vint - Dimensions: vint - For each partition: - PartitionId: vint - NumVectors: vint - ByteOffset: vlong (offset in .spd file) - ByteLength: vlong (compressed size if applicable) - Centroid: float[dimensions] - RadiusSquared: float (max distance from centroid) Footer: - CodecFooter: standard Lucene checksum footer Vector Data File (.spd): Header: - Magic: int32 (0x53504430) // "SPD0" - Version: int32 - CodecHeader: standard Lucene codec header For each partition (in order): - PartitionHeader: - NumVectors: vint - CompressionType: byte (0=none, 1=bulk) - For each vector in partition: - DocId: vint (delta-encoded within partition) - Vector: float[dimensions] or byte[compressedSize] Footer: - CodecFooter: standard Lucene checksum footer Index File (.spi): Header: - Magic: int32 (0x53504930) // "SPI0" - Version: int32 - CodecHeader: standard Lucene codec header Document Index: - NumDocs: vint - For each document (in docId order): - PartitionId: vint - IntraPartitionOrd: vint Auxiliary Structures: - DeletedDocs: RoaringBitmap (if any deletes) - FieldInfo: standard Lucene field metadata Footer: - CodecFooter: standard Lucene checksum footer **Memory Layout and Access Patterns** In-Memory Structures: class SpannSearcher { // Primary memory-resident data final float[][] partitionCentroids; // k * dimensions * 4 bytes final float[] partitionRadii; // k * 4 bytes final long[] partitionOffsets; // k * 8 bytes final int[] partitionSizes; // k * 4 bytes // Derived data for optimization final float[] centroidNorms; // k * 4 bytes (for dot product) final PartitionStats[] stats; // k * 32 bytes (min/max/mean) // Total memory: ~k * (dimensions * 4 + 56) bytes // For 1000 partitions, 768 dimensions: ~3.1MB } Sequential Access Optimization: void searchPartition(int partitionId, float[] query, KnnCollector collector) { long offset = partitionOffsets[partitionId]; int size = partitionSizes[partitionId]; // Bulk read entire partition for sequential access ByteBuffer buffer = ByteBuffer.allocateDirect(size * vectorBytes); vectorData.seek(offset); vectorData.readBytes(buffer); // Process vectors with CPU cache-friendly access for (int i = 0; i < size; i++) { int docId = readVInt(buffer); float[] vector = readVector(buffer); // Early termination if score can't improve results float score = computeSimilarity(query, vector); if (score > collector.minCompetitiveSimilarity()) { collector.collect(docId, score); } } } **Merge Strategy Implementation Details** Partition Quality Metrics: class PartitionQuality { float computeQuality(PartitionInfo partition) { // Cohesion: how close vectors are to centroid float avgDistToCentroid = partition.sumDistances / partition.numVectors; float normalizedCohesion = 1.0f / (1.0f + avgDistToCentroid); // Balance: how evenly sized partitions are float targetSize = totalVectors / numPartitions; float sizeRatio = partition.numVectors / targetSize; float balance = 1.0f - Math.abs(1.0f - sizeRatio); // Separation: how far this partition is from others float minDistToOther = Float.MAX_VALUE; for (PartitionInfo other : partitions) { if (other != partition) { float dist = computeDistance(partition.centroid, other.centroid); minDistToOther = Math.min(minDistToOther, dist); } } float separation = minDistToOther / globalMaxDistance; // Weighted quality score return 0.4f * normalizedCohesion + 0.3f * balance + 0.3f * separation; } } Adaptive Merge Decision Logic: MergeStrategy selectMergeStrategy(MergeState mergeState) { float changeRatio = (float) mergeState.newVectorCount / mergeState.existingVectorCount; float avgQuality = computeAveragePartitionQuality(); if (changeRatio < 0.05 && avgQuality > 0.8) { // Tiny merge with good partitions: preserve structure return MergeStrategy.PRESERVE_PARTITIONS; } else if (changeRatio < 0.2 && avgQuality > 0.6) { // Small merge: rebalance worst partitions only return MergeStrategy.SELECTIVE_REBALANCE; } else if (changeRatio < 0.5) { // Medium merge: incremental repartitioning return MergeStrategy.INCREMENTAL_REPARTITION; } else { // Large merge: full repartitioning for optimal structure return MergeStrategy.FULL_REPARTITION; } } **Search Algorithm Implementation** Multi-Level Partition Filtering: List<Integer> selectPartitionsToSearch(float[] query, int k, SearchParams params) { // Phase 1: Compute all centroid distances PriorityQueue<PartitionScore> pq = new PriorityQueue<>(comparingDouble(p -> p.distance)); for (int i = 0; i < numPartitions; i++) { float distance = computeDistance(query, partitionCentroids[i]); pq.offer(new PartitionScore(i, distance)); } // Phase 2: Select partitions using adaptive threshold List<Integer> selected = new ArrayList<>(); float minDistance = pq.peek().distance; float threshold = minDistance * params.getExpansionFactor(k); while (!pq.isEmpty() && selected.size() < maxPartitionsToExamine) { PartitionScore ps = pq.poll(); if (ps.distance > threshold && selected.size() >= k) { break; // Have enough partitions } selected.add(ps.partitionId); } // Phase 3: Add neighbor partitions for boundary cases if (params.enableBoundarySearch) { addNeighborPartitions(selected, query); } return selected; } Optimized Within-Partition Search: void searchPartitionOptimized(int partitionId, float[] query, KnnCollector collector, Bits acceptDocs) throws IOException { // Memory-mapped access for large partitions if (partitionSizes[partitionId] > MMAP_THRESHOLD) { searchPartitionMapped(partitionId, query, collector, acceptDocs); return; } // Bulk load for smaller partitions VectorReader reader = loadPartition(partitionId); // SIMD-friendly processing loop int vecSize = reader.size(); int simdWidth = SimdUtils.getWidth(); // Process vectors in SIMD-width batches for (int i = 0; i < vecSize; i += simdWidth) { int batchSize = Math.min(simdWidth, vecSize - i); float[] scores = new float[batchSize]; // Vectorized similarity computation SimdUtils.computeSimilarityBatch(query, reader, i, batchSize, scores); // Collect results for (int j = 0; j < batchSize; j++) { int docId = reader.getDocId(i + j); if (acceptDocs == null || acceptDocs.get(docId)) { collector.collect(docId, scores[j]); } } } } **Configuration and Tuning Parameters** public class SpannConfig { // Partitioning parameters int minPartitionSize = 1000; // Minimum vectors per partition int maxPartitionSize = 50000; // Maximum vectors per partition int targetVectorsPerPartition = 10000; // Optimal partition size // Clustering parameters int kmeansMaxIterations = 50; // Max iterations for convergence double kmeansConvergence = 0.001; // Convergence threshold int kmeansSampleSize = 100000; // Sample size for large datasets // Search parameters double baseExpansionFactor = 1.5; // Base partition expansion double expansionGrowthRate = 0.1; // Growth per log10(k) int maxPartitionFraction = 10; // Max 1/10th of partitions examined // Performance parameters int mmapThreshold = 1048576; // Use mmap for partitions > 1MB int bulkReadSize = 65536; // Bulk I/O size in bytes boolean enableSimd = true; // Use SIMD optimizations // Merge parameters float qualityThreshold = 0.6f; // Min quality to preserve partitions float rebalanceThreshold = 0.8f; // Quality threshold for rebalancing } **Design Considerations and Trade-offs** **Algorithmic Design Decisions** K-means++ vs Random Initialization: K-means++ requires more computation during initialization but produces significantly more stable partitions. The additional O(k²) initialization cost is worthwhile given that poor initial centroids can lead to highly imbalanced partitions affecting search performance. Partition Count Formula: The target of ~10K vectors per partition balances several factors: smaller partitions enable more selective filtering but increase metadata overhead, while larger partitions reduce overhead but force examination of more irrelevant vectors. The dimension-based adjustment accounts for the "curse of dimensionality" where high-dimensional spaces require more partitions to maintain effective separation. Quality Metric Weighting: The partition quality metric balances cohesion (0.4), balance (0.3), and separation (0.3). These weights reflect that cohesion is most critical for search accuracy, while balance and separation prevent pathological cases like single-vector partitions or overlapping clusters. **Memory vs. Accuracy Trade-offs** Partition Examination Strategy: The expansion factor (1.5 + log₁₀(k)) grows slowly with k to balance recall quality against computational cost. Linear growth would examine too many partitions for large k values, while constant expansion would miss relevant results for small k queries. Boundary Handling: The optional neighbor partition addition addresses the boundary problem where optimal results might lie near partition edges. This adds computational overhead but can significantly improve recall in edge cases. **Implementation Complexity Considerations** File Format Evolution: The three-file format enables independent optimization of different data access patterns while maintaining backward compatibility. Metadata and index files are typically small and can be memory-resident, while vector data can be streamed or memory-mapped based on access patterns. Merge Strategy Selection: The adaptive merge logic reflects different cost/benefit ratios at various scales. Small merges benefit from structure preservation to avoid repartitioning overhead, while large merges benefit from global optimization despite the computational cost. **Performance Characteristics** **Complexity Analysis** Indexing Complexity: - Vector assignment: O(n * k * d) where n=vectors, k=partitions, d=dimensions - K-means clustering: O(k * n * d * iterations) - Total flush complexity: O(n * k * d * log(n/k)) including sorting Search Complexity: - Partition selection: O(k * d) for centroid distance computation - Within-partition search: O(p * n_p * d) where p=partitions examined, n_p=vectors per partition - Total search: O(k * d + p * n_p * d) typically much less than HNSW's O(log(n) * d * M) Merge Complexity: - Preserve partitions: O(n) simple copying - Selective rebalance: O(n_affected * k * d) for affected partitions only - Full repartition: O(n * k * d * iterations) same as initial indexing **Memory Usage Patterns** Indexing Memory: During flush: - Vector buffer: n * d * 4 bytes (temporary) - Clustering workspace: k * d * 4 + k * 8 bytes - Total: O(n * d) temporary, O(k * d) permanent Search Memory: Permanent: - Centroids: k * d * 4 bytes - Metadata: k * 24 bytes - Total: ~3MB for typical configuration Per-query: - Priority queue: k * 12 bytes - Partition buffer: p * n_p * d * 4 bytes (can be streamed) **Detailed Implementation Phases** Phase 1: Foundation Implementation Core Format Infrastructure: - Implement Lucene95SpannVectorsFormat class hierarchy - Basic three-file format with standard Lucene headers/footers - Simple k-means partitioning algorithm for initial clustering - Integration with SegmentWriteState and SegmentReadState Basic Writer Implementation: - Lucene95SpannVectorsWriter with vector accumulation and flush-time partitioning - Partition assignment algorithms and storage layout optimization - Integration with KnnFieldVectorsWriter interface - Memory usage tracking via Accountable interface Basic Reader Implementation: - Lucene95SpannVectorsReader with partition metadata loading - Simple sequential search within selected partitions - Integration with existing KnnCollector framework - Standard integrity checking and validation Testing Infrastructure: - Unit tests for partitioning algorithms and file format - Integration tests with existing Lucene test framework - Basic performance validation against simple datasets - Compatibility testing with compound file format Deliverables: Working SPANN implementation with basic functionality, passing all compatibility tests. Phase 2: Search Optimization Partition Selection Algorithms: - Sophisticated partition ranking based on centroid distances and sizes - Early termination logic to avoid examining distant partitions - Integration with query-time parameters (k, accuracy requirements) - Support for different distance metrics and similarity functions Access Pattern Optimization: - Random access implementation for sparse partition selection - Sequential streaming for dense partition examination - Automatic selection between access patterns based on query characteristics - Memory-mapped I/O integration for large partition handling Within-Partition Optimizations: - Vectorized distance computations where supported - Cache-friendly data layout and processing patterns - SIMD utilization hints for modern CPU architectures - Prefetching strategies for predictable access patterns Search API Integration: - Full compatibility with existing search APIs and parameters - Integration with document filtering (Bits interface) - Support for both float and byte vector encodings - Proper handling of deleted documents and live docs Deliverables: Optimized search implementation competitive with existing formats on standard benchmarks. Phase 3: Advanced Merge Implementation Adaptive Merge Strategies: - Algorithms for determining when to preserve vs. rebuild partition structures - Partition quality metrics for rebalancing decisions - Integration with segment size predictions and merge policy hints - Handling of edge cases (empty segments, single-vector segments) Merge Performance Optimization: - Bulk copying optimizations for preserved partitions - Efficient repartitioning algorithms for structure updates - Memory usage minimization during merge operations - Progress reporting and cancellation support Advanced Partitioning Algorithms: - Beyond basic k-means: hierarchical clustering, density-based approaches - Partition size balancing and split/merge logic - Handling of outlier vectors and sparse regions - Incremental partition updates for small merges Integration with Merge Framework: - Proper handling of MergeState and document mapping - Integration with merge schedulers and policies - Support for merge-time document sorting and field updates - Compatibility with merge abortion and error recovery Deliverables: Production-ready merge implementation with superior performance characteristics. Phase 4: Production Hardening Memory Management Refinement: - Precise memory accounting and reporting - Integration with Lucene's memory pressure monitoring - Configurable memory limits and adaptive behavior - Garbage collection optimization and reduced allocation I/O Performance Optimization: - Advanced memory-mapped I/O strategies - Async I/O support where beneficial - Integration with Lucene's directory abstraction optimizations - Proper handling of different storage types (SSD, HDD, network) Configuration and Tuning: - Exposed parameters for partition count, size targets, algorithms - Auto-tuning based on vector characteristics and hardware - Performance monitoring and diagnostic capabilities - Integration with existing Lucene monitoring infrastructure Error Handling and Recovery: - Comprehensive error handling for corruption scenarios - Recovery procedures for partially written segments - Integration with Lucene's checksum and integrity verification - Proper resource cleanup and exception safety Deliverables: Robust, production-ready implementation with comprehensive error handling and monitoring. Phase 5: Integration and Validation Comprehensive Testing: - Large-scale testing with realistic datasets and query patterns - Performance benchmarking across different hardware configurations - Memory usage validation under various load patterns - Compatibility testing with all Lucene features (compound files, checksums, etc.) Migration and Compatibility: - Tools for migrating from existing vector formats - Mixed-format index support during transition periods - Backward compatibility testing with existing applications - Documentation for migration procedures and best practices Performance Analysis: - Detailed benchmarking against existing implementations - Analysis of performance characteristics across different scenarios - Memory usage profiling and optimization validation - I/O pattern analysis and storage optimization verification Documentation and Tooling: - Complete API documentation and usage examples - Performance tuning guides and best practices - Administrative tools for index analysis and optimization - Integration with existing Lucene tooling ecosystem Deliverables: Fully validated implementation ready for production deployment with complete documentation. Phase 6: Community Integration Code Review and Refinement: - Community review of implementation and design decisions - Code quality improvements and consistency with Lucene standards - Performance validation by community members - Integration feedback and final adjustments Documentation Finalization: - User documentation and migration guides - Developer documentation for extensibility - Performance characteristics and tuning recommendations - Troubleshooting guides and common issues Release Preparation: - Final testing and validation - Release notes and changelog preparation - Coordination with Lucene release process - Community communication and adoption planning Deliverables: Implementation ready for inclusion in Lucene release with full community validation. **Open Technical Questions** Algorithm Variations: - Alternative clustering algorithms (mini-batch k-means, hierarchical clustering) may provide different trade-offs between partitioning quality and computational cost - Dynamic partition count adjustment based on distribution characteristics could optimize for different vector datasets - Learned partition boundaries using ML techniques might improve upon traditional clustering methods Optimization Opportunities: - Product quantization within partitions could provide compression benefits while maintaining search quality - Hierarchical partitioning for very large indices might enable better scalability than flat partitioning Integration Considerations: - Interaction with Lucene's emerging vector quantization work needs evaluation for compatibility and optimization opportunities - Support for incremental indexing patterns beyond the current segment-based approach - Integration with distributed search scenarios and cross-segment optimization opportunities Validation Requirements: - Comprehensive benchmarking across diverse vector datasets to validate algorithmic assumptions - Memory usage validation under realistic production workloads and various JVM configurations - Performance comparison methodology development for fair evaluation against existing implementations **Conclusion** SPANN provides a production-ready alternative to current vector search implementations, with clear technical specifications for core algorithms and data structures. The implementation leverages Lucene's architectural strengths while providing predictable performance characteristics and resource usage patterns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org