Re: Indexing of non-english text with Solr, any known limitations?

Yonik Seeley Wed, 12 Apr 2006 07:46:13 -0700

On 4/12/06, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:
> Hi Solr users,
>
> I'm investigating indexers for a project, played a bit with both Solr
> and Nutch recently, and the Solr "RESTful indexing component" concept
> fits our needs quite well.
>
> Before I dig too deep, are there any known limitations w.r.t indexing
> of non-english text?


Nothing inherent.  Most external interfaces with Solr are UTF-8...
stopword-lists, synonym lists, query responses, etc.

> The project that I'm looking at is currently single-language
> (French), which I assume can be handled by static configuration of
> the appropriate analyzers.

Yes, with a little bit of work (making a Solr Filter Factory or
Tokenizer factory) you can use any Lucene filter, tokenizer, or
analyzer.

There is currently a SnowballPorterFilterFactory, but it's hard-coded
to "English".  That should be changed.

> But we might have to make sure we can handle multiple languages
> cleanly in a single index before making a final decision on which
> indexer to use, as here in Switzerland we very often have to handle
> multiple languages.

Would you need to index multiple languages in the same field?  That
could be trickier, and it seems like you would need an analyzer that
supported that.

-Yonik

Re: Indexing of non-english text with Solr, any known limitations?

Reply via email to