Peter, Thanks for your help, I will prototype your solution and see if it makes sense for me.
Eli On Mon, May 5, 2008 at 5:38 PM, Binkley, Peter <[EMAIL PROTECTED]> wrote: > It won't make much difference to the index size, since you'll only be > populating one of the language fields for each document, and empty > fields cost nothing. The performance may suffer a bit but Lucene may > surprise you with how good it is with that kind of boolean query. > > I agree that as the number of fields and languages increases, this is > going to become a lot to manage. But you're up against some basic > problems when you try to model this in Solr: for each token, you care > about not just its value (which is all Lucene cares about) but also its > language and its stem; and the stem for a given token depends on the > language (different stemming rules); and at query time you may not know > the language. I don't think you're going to get a solution without some > redundancy; but solving problems by adding redundant fields is a common > method in Solr. > > > Peter > > > -----Original Message----- > From: Eli K [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 05, 2008 2:28 PM > To: solr-user@lucene.apache.org > > > Subject: Re: multi-language searching with Solr > > Wouldn't this impact both indexing and search performance and the size > of the index? > It is also probable that I will have more then one free text fields > later on and with at least 20 languages this approach does not seem very > manageable. Are there other options for making this work with stemming? > > Thanks, > > Eli > > > On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter > <[EMAIL PROTECTED]> wrote: > > I think you would have to declare a separate field for each language > > (freetext_en, freetext_fr, etc.), each with its own appropriate > > stemming. Your ingestion process would have to assign the free text > > content for each document to the appropriate field; so, for each > > document, only one of the freetext fields would be populated. At > > search time, you would either search against the appropriate field if > > > you know the search language, or search across them with > > "freetext_fr:query OR freetext_en:query OR ...". That way your query > > will be interpreted by each language field using that language's > stemming rules. > > > > Other options for combining indexes, such as copyfield or dynamic > > fields (see http://wiki.apache.org/solr/SchemaXml), would lead to a > > single field type and therefore a single type of stemming. You could > > always use copyfield to create an unstemmed common index, if you > > don't care about stemming when you search across languages (since > > you're likely to get odd results when a query in one language is > > stemmed according to the rules of another language). > > > > Peter > > > > > > > > -----Original Message----- > > From: Eli K [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 05, 2008 8:27 AM > > To: solr-user@lucene.apache.org > > Subject: multi-language searching with Solr > > > > Hello folks, > > > > Let me start by saying that I am new to Lucene and Solr. > > > > I am in the process of designing a search back-end for a system that > > > receives 20k documents a day and needs to keep them available for 30 > > days. The documents should be searchable on a free text field and on > > > about 8 other fields. > > > > One of my requirements is to index and search documents in multiple > > languages. I would like to have the ability to stem and provide the > > advanced search features that are based on it. This will only affect > > > the free text field because the rest of the fields are in English. > > > > I can find out the language of the document before indexing and I > > might be able to provide the language to search on. I also need to > > have the ability to search across all indexed languages (there will > > be 20 in total). > > > > Given these requirements do you think this is doable with Solr? A > > major limiting factor is that I need to stick to the 1.2 GA version > > and I cannot utilize the multi-core features in the 1.3 trunk. > > > > I considered writing my own analyzer that will call the appropriate > > Lucene analyzer for the given language but I did not see any way for > > it to access the field that specifies the language of the document. > > > > Thanks, > > > > Eli > > > > p.s. I am looking for an experienced Lucene/Solr consultant to help > > with the design of this system. > > > > > >