Re: multi-language searching with Solr

Eli K Tue, 06 May 2008 09:27:00 -0700

Peter,

Thanks for your help, I will prototype your solution and see if it
makes sense for me.


Eli

On Mon, May 5, 2008 at 5:38 PM, Binkley, Peter
<[EMAIL PROTECTED]> wrote:
> It won't make much difference to the index size, since you'll only be
>  populating one of the language fields for each document, and empty
>  fields cost nothing. The performance may suffer a bit but Lucene may
>  surprise you with how good it is with that kind of boolean query.
>
>  I agree that as the number of fields and languages increases, this is
>  going to become a lot to manage. But you're up against some basic
>  problems when you try to model this in Solr: for each token, you care
>  about not just its value (which is all Lucene cares about) but also its
>  language and its stem; and the stem for a given token depends on the
>  language (different stemming rules); and at query time you may not know
>  the language. I don't think you're going to get a solution without some
>  redundancy; but solving problems by adding redundant fields is a common
>  method in Solr.
>
>
>  Peter
>
>
>  -----Original Message-----
>  From: Eli K [mailto:[EMAIL PROTECTED]
>
> Sent: Monday, May 05, 2008 2:28 PM
>  To: solr-user@lucene.apache.org
>
>
> Subject: Re: multi-language searching with Solr
>
>  Wouldn't this impact both indexing and search performance and the size
>  of the index?
>  It is also probable that I will have more then one free text fields
>  later on and with at least 20 languages this approach does not seem very
>  manageable.  Are there other options for making this work with stemming?
>
>  Thanks,
>
>  Eli
>
>
>  On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter
>  <[EMAIL PROTECTED]> wrote:
>  > I think you would have to declare a separate field for each language
>  > (freetext_en, freetext_fr, etc.), each with its own appropriate
>  > stemming. Your ingestion process would have to assign the free text
>  > content for each document to the appropriate field; so, for each
>  > document, only one of the freetext fields would be populated. At
>  > search  time, you would either search against the appropriate field if
>
>  > you know  the search language, or search across them with
>  > "freetext_fr:query OR  freetext_en:query OR ...". That way your query
>  > will be interpreted by  each language field using that language's
>  stemming rules.
>  >
>  >  Other options for combining indexes, such as copyfield or dynamic
>  > fields  (see http://wiki.apache.org/solr/SchemaXml), would lead to a
>  > single  field type and therefore a single type of stemming. You could
>  > always use  copyfield to create an unstemmed common index, if you
>  > don't care about  stemming when you search across languages (since
>  > you're likely to get  odd results when a query in one language is
>  > stemmed according to the  rules of another language).
>  >
>  >  Peter
>  >
>  >
>  >
>  >  -----Original Message-----
>  >  From: Eli K [mailto:[EMAIL PROTECTED]
>  >  Sent: Monday, May 05, 2008 8:27 AM
>  >  To: solr-user@lucene.apache.org
>  >  Subject: multi-language searching with Solr
>  >
>  >  Hello folks,
>  >
>  >  Let me start by saying that I am new to Lucene and Solr.
>  >
>  >  I am in the process of designing a search back-end for a system that
>
>  > receives 20k documents a day and needs to keep them available for 30
>  > days.  The documents should be searchable on a free text field and on
>
>  > about 8 other fields.
>  >
>  >  One of my requirements is to index and search documents in multiple
>  > languages.  I would like to have the ability to stem and provide the
>  > advanced search features that are based on it.  This will only affect
>
>  > the free text field because the rest of the fields are in English.
>  >
>  >  I can find out the language of the document before indexing and I
>  > might  be able to provide the language to search on.  I also need to
>  > have the  ability to search across all indexed languages (there will
>  > be 20 in  total).
>  >
>  >  Given these requirements do you think this is doable with Solr?  A
>  > major  limiting factor is that I need to stick to the 1.2 GA version
>  > and I  cannot utilize the multi-core features in the 1.3 trunk.
>  >
>  >  I considered writing my own analyzer that will call the appropriate
>  > Lucene analyzer for the given language but I did not see any way for
>  > it  to access the field that specifies the language of the document.
>  >
>  >  Thanks,
>  >
>  >  Eli
>  >
>  >  p.s. I am looking for an experienced Lucene/Solr consultant to help
>  > with  the design of this system.
>  >
>  >
>
>

Re: multi-language searching with Solr

Reply via email to