uschindler commented on PR #14275: URL: https://github.com/apache/lucene/pull/14275#issuecomment-2695637219
Hi, > I will throw in a real usecase that gives us a bit of headache: completion fields. All the existing codecs load them on heap, and we want to make a switch to load them off heap in certain situations. We could do so with a new codec, but that would only affect newly indexed data, and that is somehow weird as in this specific case it's only a matter of how data gets loaded, nothing to do with how it gets stored. Looks like loading off heap for existing data would only be possible via horrible hacks, if at all? I full agree that this is a problem. But in general that should NOT be linked to the SPI rsolving. SPI is only there to load the correct codec to actually decode the index format. The decission of that should be done on heap or off heap is orthogonal. Maybe it is an XY problem: If Elasticserach wants to exchange a codec and replace it by an own implementation that loads stuff on on/off-heap or uses native code, this could be a requirement we have to decide. With SPI alone, this won't work (unless you write some extra codec version into the index files). ES could do this, but it would make those indexes harder to maintain. I think we should differentiate: - The codec is there to actually understand the index format. If the index format is different, invent a new codec. - If the implementation used for some algorithmic decission needs to be configured (which is not tied to the index format, like on-heap or off-heap), this should be made configurable on the codec itsself, but not replace the codec. If some decoding requires an specific hardware instruction set or similar, another SPI is needed (like VectorizationProvider in Lucene). How to make those configuration settings? One solution previously used in Lucene was to have codec parameters while writing (stored fields used this to enable/disable compression). To read, the only way to do this was system properties or static setters. I don't like that. My proposal would be: Let's add some key-value pairs of "codec options" like done in Analyzers, that can be passed as part of the IndexWriterConfig (while writing) or passed to DirectoryReader (as IndexReaderConfig => just a plain map). Maybe the keys should be standardized to take the codec name, followed by "." and then a key name. This map is taken by DirectoryReader.open() and saved as instance field in DirectoryReader and passed to each codec when requesting instances of fields, vectors, stored fields,... The codec can lookup then ask for the property by their codec name and key and apply it to its configuration. This could be used straigth already to configure encoding or compression of stored fields (while writing) or if the FST should be loaded on heap or off-heap. Both need no different index format, it just selects options how to handle encoding or decoding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org