On 4/12/06, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote: > Hi Solr users, > > I'm investigating indexers for a project, played a bit with both Solr > and Nutch recently, and the Solr "RESTful indexing component" concept > fits our needs quite well. > > Before I dig too deep, are there any known limitations w.r.t indexing > of non-english text?
Nothing inherent. Most external interfaces with Solr are UTF-8... stopword-lists, synonym lists, query responses, etc. > The project that I'm looking at is currently single-language > (French), which I assume can be handled by static configuration of > the appropriate analyzers. Yes, with a little bit of work (making a Solr Filter Factory or Tokenizer factory) you can use any Lucene filter, tokenizer, or analyzer. There is currently a SnowballPorterFilterFactory, but it's hard-coded to "English". That should be changed. > But we might have to make sure we can handle multiple languages > cleanly in a single index before making a final decision on which > indexer to use, as here in Switzerland we very often have to handle > multiple languages. Would you need to index multiple languages in the same field? That could be trickier, and it seems like you would need an analyzer that supported that. -Yonik