I searched the Solr list but not as much the Lucene list. I will look again to see if there is something there that might work with Solr. I rather leverage Solr, but if I have no choice I will to do this using Lucene only.
Thanks, Eli On Mon, May 5, 2008 at 4:58 PM, Erick Erickson <[EMAIL PROTECTED]> wrote: > You might want to bounce over to the Lucene user's list and search > for language. This topic has arisen many times and there's some good > discussion. And have you searched the solr users list of "language"? I > know it's turned up here as well. > > Best > Erick > > > > On Mon, May 5, 2008 at 4:28 PM, Eli K <[EMAIL PROTECTED]> wrote: > > > Wouldn't this impact both indexing and search performance and the size > > of the index? > > It is also probable that I will have more then one free text fields > > later on and with at least 20 languages this approach does not seem > > very manageable. Are there other options for making this work with > > stemming? > > > > Thanks, > > > > Eli > > > > > > On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter > > <[EMAIL PROTECTED]> wrote: > > > I think you would have to declare a separate field for each language > > > (freetext_en, freetext_fr, etc.), each with its own appropriate > > > stemming. Your ingestion process would have to assign the free text > > > content for each document to the appropriate field; so, for each > > > document, only one of the freetext fields would be populated. At search > > > time, you would either search against the appropriate field if you know > > > the search language, or search across them with "freetext_fr:query OR > > > freetext_en:query OR ...". That way your query will be interpreted by > > > each language field using that language's stemming rules. > > > > > > Other options for combining indexes, such as copyfield or dynamic > > fields > > > (see http://wiki.apache.org/solr/SchemaXml), would lead to a single > > > field type and therefore a single type of stemming. You could always > > use > > > copyfield to create an unstemmed common index, if you don't care about > > > stemming when you search across languages (since you're likely to get > > > odd results when a query in one language is stemmed according to the > > > rules of another language). > > > > > > Peter > > > > > > > > > > > > -----Original Message----- > > > From: Eli K [mailto:[EMAIL PROTECTED] > > > Sent: Monday, May 05, 2008 8:27 AM > > > To: solr-user@lucene.apache.org > > > Subject: multi-language searching with Solr > > > > > > Hello folks, > > > > > > Let me start by saying that I am new to Lucene and Solr. > > > > > > I am in the process of designing a search back-end for a system that > > > receives 20k documents a day and needs to keep them available for 30 > > > days. The documents should be searchable on a free text field and on > > > about 8 other fields. > > > > > > One of my requirements is to index and search documents in multiple > > > languages. I would like to have the ability to stem and provide the > > > advanced search features that are based on it. This will only affect > > > the free text field because the rest of the fields are in English. > > > > > > I can find out the language of the document before indexing and I might > > > be able to provide the language to search on. I also need to have the > > > ability to search across all indexed languages (there will be 20 in > > > total). > > > > > > Given these requirements do you think this is doable with Solr? A > > major > > > limiting factor is that I need to stick to the 1.2 GA version and I > > > cannot utilize the multi-core features in the 1.3 trunk. > > > > > > I considered writing my own analyzer that will call the appropriate > > > Lucene analyzer for the given language but I did not see any way for it > > > to access the field that specifies the language of the document. > > > > > > Thanks, > > > > > > Eli > > > > > > p.s. I am looking for an experienced Lucene/Solr consultant to help > > with > > > the design of this system. > > > > > > > > >