It isn’t just complicated, it can be impossible. Do you have content in Chinese or Japanese? Those languages (and some others) do not separate words with spaces. You cannot even do word search without a language-specific, dictionary-based parser.
German is space separated, except many noun compounds are not space-separated. Do you have Finnish content? Entire prepositional phrases turn into word endings. Do you have Arabic content? That is even harder. If all your content is in space-separated languages that are not heavily inflected, you can kind of do OK with a language-insensitive approach. But it hits the wall pretty fast. One thing that does work pretty well is trademarked names (LaserJet, Coke, etc). Those are spelled the same in all languages and usually not inflected. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 23, 2015, at 8:00 PM, Rishi Easwaran <rishi.easwa...@aol.com> wrote: > Hi Alex, > > There is no specific language list. > For example: the documents that needs to be indexed are emails or any > messages for a global customer base. The messages back and forth could be in > any language or mix of languages. > > I understand relevancy, stemming etc becomes extremely complicated with > multilingual support, but our first goal is to be able to tokenize and > provide basic search capability for any language. Ex: When the document > contains hello or здравствуйте, the analyzer creates tokens and provides > exact match search results. > > Now it would be great if it had capability to tokenize email addresses > (ex:he...@aol.com- i think standardTokenizer already does this), filenames > (здравствуйте.pdf), but maybe we can use filters to accomplish that. > > Thanks, > Rishi. > > -----Original Message----- > From: Alexandre Rafalovitch <arafa...@gmail.com> > To: solr-user <solr-user@lucene.apache.org> > Sent: Mon, Feb 23, 2015 5:49 pm > Subject: Re: Basic Multilingual search capability > > > Which languages are you expecting to deal with? Multilingual support > is a complex issue. Even if you think you don't need much, it is > usually a lot more complex than expected, especially around relevancy. > > Regards, > Alex. > ---- > Sign up for my Solr resources newsletter at http://www.solr-start.com/ > > > On 23 February 2015 at 16:19, Rishi Easwaran <rishi.easwa...@aol.com> wrote: >> Hi All, >> >> For our use case we don't really need to do a lot of manipulation of >> incoming > text during index time. At most removal of common stop words, tokenize > emails/ > filenames etc if possible. We get text documents from our end users, which > can > be in any language (sometimes combination) and we cannot determine the > language > of the incoming text. Language detection at index time is not necessary. >> >> Which analyzer is recommended to achive basic multilingual search capability > for a use case like this. >> I have read a bunch of posts about using a combination standardtokenizer or > ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking > for > ideas, suggestions, best practices. >> >> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 >> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 >> https://issues.apache.org/jira/browse/SOLR-6492 >> >> >> Thanks, >> Rishi. >> > >