Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system.
Regards, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood <wun...@wunderwood.org> wrote: >> I understand relevancy, stemming etc becomes extremely complicated with >> multilingual support, but our first goal is to be able to tokenize and >> provide basic search capability for any language. Ex: When the document >> contains hello or здравствуйте, the analyzer creates tokens and provides >> exact match search results.