uschindler commented on PR #868: URL: https://github.com/apache/lucene/pull/868#issuecomment-1118585168
> Also - I don't really buy the idea that we can't support binary file formats - the entire index is filled with binary files. In this case we provide tools for generating these files, so users are free to regenerate them from source when Lucene version changes. There's no need to backwards-compatibly support old formats. This is still odd, because we have not much error handling in those file formats, because the code was written to load it from the JAR file, so it is basically more or less a dump of the FST and ConnectionCosts. Sure you can regenerate them, but what is the issue in then also call `gradlew jar`? I think that's the main issue here: You need Lucene's source code anyways to build the dictionaries, you have to put the source files somewhere, so you actually forking lucene at that point. If we really want to support external dictionaries we should refactor the API so you can load just one combined (CFS/ZIP like file) that you can easily drop anywhere. This file would encode some version number in it and if you load a file thats not using actual version it bails out. What I would propose: - Add a gradle task that builds a dictionary package and that should be the same for Nori and Kuromoji, just different input files - Have the same factory class and exact same implementation for both dictionaries (I think @mocobeta is working on this). So a user should be able to load a single (zip-like) file and pass it to analyzer/tokenizer and it will automatically be Nori or Kuromojo, no matter what. The API is then very simple: `MorphologicalModel#load(aSingleFileNameOrURLOrInputStream)` - The default Tokenizers shipped in Lucene have no custom ctors, so JapaneseTokenizer behind the scenes loads a single japanese dictionary file from classpath. Anybody wanting to load any other file will use a generic tokenizer impl. The Japanese one shipped with lucene uses its default dictionary. Maybe we could also put the tokenizer in its separate JAR file (for both Japan and Korea) and ship the defacult dictionaries as separate JAR files on Maven central. The main desaster is the number of files which also makes it very error-prone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org