uschindler commented on PR #868:
URL: https://github.com/apache/lucene/pull/868#issuecomment-1118585168

   > Also - I don't really buy the idea that we can't support binary file 
formats - the entire index is filled with binary files. In this case we provide 
tools for generating these files, so users are free to regenerate them from 
source when Lucene version changes. There's no need to backwards-compatibly 
support old formats.
   
   This is still odd, because we have not much error handling in those file 
formats, because the code was written to load it from the JAR file, so it is 
basically more or less a dump of the FST and ConnectionCosts. Sure you can 
regenerate them, but what is the issue in then also call `gradlew jar`? I think 
that's the main issue here: You need Lucene's source code anyways to build the 
dictionaries, you have to put the source files somewhere, so you actually 
forking lucene at that point.
   
   If we really want to support external dictionaries we should refactor the 
API so you can load just one combined (CFS/ZIP like file) that you can easily 
drop anywhere. This file would encode some version number in it and if you load 
a file thats not using actual version it bails out.
   
   What I would propose:
   - Add a gradle task that builds a dictionary package and that should be the 
same for Nori and Kuromoji, just different input files
   - Have the same factory class and exact same implementation for both 
dictionaries (I think @mocobeta is working on this). So a user should be able 
to load a single (zip-like) file and pass it to analyzer/tokenizer and it will 
automatically be Nori or Kuromojo, no matter what. The API is then very simple: 
`MorphologicalModel#load(aSingleFileNameOrURLOrInputStream)`
   - The default Tokenizers shipped in Lucene have no custom ctors, so 
JapaneseTokenizer behind the scenes loads a single japanese dictionary file 
from classpath. Anybody wanting to load any other file will use a generic 
tokenizer impl. The Japanese one shipped with lucene uses its default 
dictionary. Maybe we could also put the tokenizer in its separate JAR file (for 
both Japan and Korea) and ship the defacult dictionaries as separate JAR files 
on Maven central.
   
   The main desaster is the number of files which also makes it very 
error-prone. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to