Hello everyone,

I am writing an application that is supposed to do multilingual word/term 
lookup in text documents. Basically, I have a wordlist (WL) that has terms in 
different languages. Actually, the WL consists of a set of English terms and 
then corresponding translations of these terms into other languages. We receive 
documents that are most likely not in English, although I don't have any 
automatic tools that would detect  the language or languages of an incoming 
document and/or segment the doc into language blocks if there are several langs 
present. I have to search for all terms from WL in each doc. If there is a hit 
and it's not in English, I need to provide the user with a English translation 
of it - which I can do by using links between terms in different languages 
provided in the WL.

The question is which side to index, the docs or the WL. The 1st option that 
comes to mind is to index the docs. That is, for each incoming doc that I would 
need to search, I would create N indices where N is the number of languages 
appearing in the WL. When creating these indices, I would use the different 
language analyzers that are available with Lucene. Then I would use each block 
of terms from WL in a given language to search against the corresponding index 
(i.e. French terms against the French index, Chinese terms against the Chinese 
index etc). I would have to index each incoming document once - for each 
language - and won't have to apply any language analysis tools to the WL 
entries to segment them since they are already given as separate terms. At the 
same time, I'd have to make sure I delete the indices when a document is 
processed and removed from the system.

But would it make sense to do it the other way around and index the WL and then 
search each doc against this index (well, these indices, to be precise)? Would 
I need to perform some sort of language-dependent segmentation of a document 
before I can do a meaningful search of it against the WL indices? Any other 
caveats? What do I have to take into account when deciding which direction to 
go?

Thank you in advance for any info

Ilya Zavorin

Reply via email to