Hello everyone, I am writing an application that is supposed to do multilingual word/term lookup in text documents. Basically, I have a wordlist (WL) that has terms in different languages. Actually, the WL consists of a set of English terms and then corresponding translations of these terms into other languages. We receive documents that are most likely not in English, although I don't have any automatic tools that would detect the language or languages of an incoming document and/or segment the doc into language blocks if there are several langs present. I have to search for all terms from WL in each doc. If there is a hit and it's not in English, I need to provide the user with a English translation of it - which I can do by using links between terms in different languages provided in the WL.
The question is which side to index, the docs or the WL. The 1st option that comes to mind is to index the docs. That is, for each incoming doc that I would need to search, I would create N indices where N is the number of languages appearing in the WL. When creating these indices, I would use the different language analyzers that are available with Lucene. Then I would use each block of terms from WL in a given language to search against the corresponding index (i.e. French terms against the French index, Chinese terms against the Chinese index etc). I would have to index each incoming document once - for each language - and won't have to apply any language analysis tools to the WL entries to segment them since they are already given as separate terms. At the same time, I'd have to make sure I delete the indices when a document is processed and removed from the system. But would it make sense to do it the other way around and index the WL and then search each doc against this index (well, these indices, to be precise)? Would I need to perform some sort of language-dependent segmentation of a document before I can do a meaningful search of it against the WL indices? Any other caveats? What do I have to take into account when deciding which direction to go? Thank you in advance for any info Ilya Zavorin
