Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

via GitHub Mon, 03 Mar 2025 13:47:58 -0800


uschindler commented on PR #14275:
URL: https://github.com/apache/lucene/pull/14275#issuecomment-2695637219


   Hi,
   
   > I will throw in a real usecase that gives us a bit of headache: completion 
fields. All the existing codecs load them on heap, and we want to make a switch 
to load them off heap in certain situations. We could do so with a new codec, 
but that would only affect newly indexed data, and that is somehow weird as in 
this specific case it's only a matter of how data gets loaded, nothing to do 
with how it gets stored. Looks like loading off heap for existing data would 
only be possible via horrible hacks, if at all?
   
   I full agree that this is a problem. But in general that should NOT be 
linked to the SPI rsolving. SPI is only there to load the correct codec to 
actually decode the index format. The decission of that should be done on heap 
or off heap is orthogonal.
   
   Maybe it is an XY problem: If Elasticserach wants to exchange a codec and 
replace it by an own implementation that loads stuff on on/off-heap or uses 
native code, this could be a requirement we have to decide. With SPI alone, 
this won't work (unless you write some extra codec version into the index 
files). ES could do this, but it would make those indexes harder to maintain.
   
   I think we should differentiate:
   - The codec is there to actually understand the index format. If the index 
format is different, invent a new codec.
   -  If the implementation used for some algorithmic decission needs to be 
configured (which is not tied to the index format, like on-heap or off-heap), 
this should be made configurable on the codec itsself, but not replace the 
codec. If some decoding requires an specific hardware instruction set or 
similar, another SPI is needed (like VectorizationProvider in Lucene).
   
   How to make those configuration settings?
   
   One solution previously used in Lucene was to have codec parameters while 
writing (stored fields used this to enable/disable compression). To read, the 
only way to do this was system properties or static setters. I don't like that.
   
   My proposal would be: Let's add some key-value pairs of "codec options" like 
done in Analyzers, that can be passed as part of the IndexWriterConfig (while 
writing) or passed to DirectoryReader (as IndexReaderConfig => just a plain 
map). Maybe the keys should be standardized to take the codec name, followed by 
"." and then a key name.
   
   This map is taken by DirectoryReader.open() and saved as instance field in 
DirectoryReader and passed to each codec when requesting instances of fields, 
vectors, stored fields,... The codec can lookup then ask for the property by 
their codec name and key and apply it to its configuration. This could be used 
straigth already to configure encoding or compression of stored fields (while 
writing) or if the FST should be loaded on heap or off-heap. Both need no 
different index format, it just selects options how to handle encoding or 
decoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Support load per-iteration replacement of NamedSPI [lucene]

Reply via email to