multi-language searching with Solr
Hello folks, Let me start by saying that I am new to Lucene and Solr. I am in the process of designing a search back-end for a system that receives 20k documents a day and needs to keep them available for 30 days. The documents should be searchable on a free text field and on about 8 other fields. One of my requirements is to index and search documents in multiple languages. I would like to have the ability to stem and provide the advanced search features that are based on it. This will only affect the free text field because the rest of the fields are in English. I can find out the language of the document before indexing and I might be able to provide the language to search on. I also need to have the ability to search across all indexed languages (there will be 20 in total). Given these requirements do you think this is doable with Solr? A major limiting factor is that I need to stick to the 1.2 GA version and I cannot utilize the multi-core features in the 1.3 trunk. I considered writing my own analyzer that will call the appropriate Lucene analyzer for the given language but I did not see any way for it to access the field that specifies the language of the document. Thanks, Eli p.s. I am looking for an experienced Lucene/Solr consultant to help with the design of this system.
Re: multi-language searching with Solr
Wouldn't this impact both indexing and search performance and the size of the index? It is also probable that I will have more then one free text fields later on and with at least 20 languages this approach does not seem very manageable. Are there other options for making this work with stemming? Thanks, Eli On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter <[EMAIL PROTECTED]> wrote: > I think you would have to declare a separate field for each language > (freetext_en, freetext_fr, etc.), each with its own appropriate > stemming. Your ingestion process would have to assign the free text > content for each document to the appropriate field; so, for each > document, only one of the freetext fields would be populated. At search > time, you would either search against the appropriate field if you know > the search language, or search across them with "freetext_fr:query OR > freetext_en:query OR ...". That way your query will be interpreted by > each language field using that language's stemming rules. > > Other options for combining indexes, such as copyfield or dynamic fields > (see http://wiki.apache.org/solr/SchemaXml), would lead to a single > field type and therefore a single type of stemming. You could always use > copyfield to create an unstemmed common index, if you don't care about > stemming when you search across languages (since you're likely to get > odd results when a query in one language is stemmed according to the > rules of another language). > > Peter > > > > -Original Message- > From: Eli K [mailto:[EMAIL PROTECTED] > Sent: Monday, May 05, 2008 8:27 AM > To: solr-user@lucene.apache.org > Subject: multi-language searching with Solr > > Hello folks, > > Let me start by saying that I am new to Lucene and Solr. > > I am in the process of designing a search back-end for a system that > receives 20k documents a day and needs to keep them available for 30 > days. The documents should be searchable on a free text field and on > about 8 other fields. > > One of my requirements is to index and search documents in multiple > languages. I would like to have the ability to stem and provide the > advanced search features that are based on it. This will only affect > the free text field because the rest of the fields are in English. > > I can find out the language of the document before indexing and I might > be able to provide the language to search on. I also need to have the > ability to search across all indexed languages (there will be 20 in > total). > > Given these requirements do you think this is doable with Solr? A major > limiting factor is that I need to stick to the 1.2 GA version and I > cannot utilize the multi-core features in the 1.3 trunk. > > I considered writing my own analyzer that will call the appropriate > Lucene analyzer for the given language but I did not see any way for it > to access the field that specifies the language of the document. > > Thanks, > > Eli > > p.s. I am looking for an experienced Lucene/Solr consultant to help with > the design of this system. > >
Re: multi-language searching with Solr
I searched the Solr list but not as much the Lucene list. I will look again to see if there is something there that might work with Solr. I rather leverage Solr, but if I have no choice I will to do this using Lucene only. Thanks, Eli On Mon, May 5, 2008 at 4:58 PM, Erick Erickson <[EMAIL PROTECTED]> wrote: > You might want to bounce over to the Lucene user's list and search > for language. This topic has arisen many times and there's some good > discussion. And have you searched the solr users list of "language"? I > know it's turned up here as well. > > Best > Erick > > > > On Mon, May 5, 2008 at 4:28 PM, Eli K <[EMAIL PROTECTED]> wrote: > > > Wouldn't this impact both indexing and search performance and the size > > of the index? > > It is also probable that I will have more then one free text fields > > later on and with at least 20 languages this approach does not seem > > very manageable. Are there other options for making this work with > > stemming? > > > > Thanks, > > > > Eli > > > > > > On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter > > <[EMAIL PROTECTED]> wrote: > > > I think you would have to declare a separate field for each language > > > (freetext_en, freetext_fr, etc.), each with its own appropriate > > > stemming. Your ingestion process would have to assign the free text > > > content for each document to the appropriate field; so, for each > > > document, only one of the freetext fields would be populated. At search > > > time, you would either search against the appropriate field if you know > > > the search language, or search across them with "freetext_fr:query OR > > > freetext_en:query OR ...". That way your query will be interpreted by > > > each language field using that language's stemming rules. > > > > > > Other options for combining indexes, such as copyfield or dynamic > > fields > > > (see http://wiki.apache.org/solr/SchemaXml), would lead to a single > > > field type and therefore a single type of stemming. You could always > > use > > > copyfield to create an unstemmed common index, if you don't care about > > > stemming when you search across languages (since you're likely to get > > > odd results when a query in one language is stemmed according to the > > > rules of another language). > > > > > > Peter > > > > > > > > > > > > -Original Message- > > > From: Eli K [mailto:[EMAIL PROTECTED] > > > Sent: Monday, May 05, 2008 8:27 AM > > > To: solr-user@lucene.apache.org > > > Subject: multi-language searching with Solr > > > > > > Hello folks, > > > > > > Let me start by saying that I am new to Lucene and Solr. > > > > > > I am in the process of designing a search back-end for a system that > > > receives 20k documents a day and needs to keep them available for 30 > > > days. The documents should be searchable on a free text field and on > > > about 8 other fields. > > > > > > One of my requirements is to index and search documents in multiple > > > languages. I would like to have the ability to stem and provide the > > > advanced search features that are based on it. This will only affect > > > the free text field because the rest of the fields are in English. > > > > > > I can find out the language of the document before indexing and I might > > > be able to provide the language to search on. I also need to have the > > > ability to search across all indexed languages (there will be 20 in > > > total). > > > > > > Given these requirements do you think this is doable with Solr? A > > major > > > limiting factor is that I need to stick to the 1.2 GA version and I > > > cannot utilize the multi-core features in the 1.3 trunk. > > > > > > I considered writing my own analyzer that will call the appropriate > > > Lucene analyzer for the given language but I did not see any way for it > > > to access the field that specifies the language of the document. > > > > > > Thanks, > > > > > > Eli > > > > > > p.s. I am looking for an experienced Lucene/Solr consultant to help > > with > > > the design of this system. > > > > > > > > >
Re: multi-language searching with Solr
Peter, Thanks for your help, I will prototype your solution and see if it makes sense for me. Eli On Mon, May 5, 2008 at 5:38 PM, Binkley, Peter <[EMAIL PROTECTED]> wrote: > It won't make much difference to the index size, since you'll only be > populating one of the language fields for each document, and empty > fields cost nothing. The performance may suffer a bit but Lucene may > surprise you with how good it is with that kind of boolean query. > > I agree that as the number of fields and languages increases, this is > going to become a lot to manage. But you're up against some basic > problems when you try to model this in Solr: for each token, you care > about not just its value (which is all Lucene cares about) but also its > language and its stem; and the stem for a given token depends on the > language (different stemming rules); and at query time you may not know > the language. I don't think you're going to get a solution without some > redundancy; but solving problems by adding redundant fields is a common > method in Solr. > > > Peter > > > -Original Message- > From: Eli K [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 05, 2008 2:28 PM > To: solr-user@lucene.apache.org > > > Subject: Re: multi-language searching with Solr > > Wouldn't this impact both indexing and search performance and the size > of the index? > It is also probable that I will have more then one free text fields > later on and with at least 20 languages this approach does not seem very > manageable. Are there other options for making this work with stemming? > > Thanks, > > Eli > > > On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter > <[EMAIL PROTECTED]> wrote: > > I think you would have to declare a separate field for each language > > (freetext_en, freetext_fr, etc.), each with its own appropriate > > stemming. Your ingestion process would have to assign the free text > > content for each document to the appropriate field; so, for each > > document, only one of the freetext fields would be populated. At > > search time, you would either search against the appropriate field if > > > you know the search language, or search across them with > > "freetext_fr:query OR freetext_en:query OR ...". That way your query > > will be interpreted by each language field using that language's > stemming rules. > > > > Other options for combining indexes, such as copyfield or dynamic > > fields (see http://wiki.apache.org/solr/SchemaXml), would lead to a > > single field type and therefore a single type of stemming. You could > > always use copyfield to create an unstemmed common index, if you > > don't care about stemming when you search across languages (since > > you're likely to get odd results when a query in one language is > > stemmed according to the rules of another language). > > > > Peter > > > > > > > > -Original Message- > > From: Eli K [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 05, 2008 8:27 AM > > To: solr-user@lucene.apache.org > > Subject: multi-language searching with Solr > > > > Hello folks, > > > > Let me start by saying that I am new to Lucene and Solr. > > > > I am in the process of designing a search back-end for a system that > > > receives 20k documents a day and needs to keep them available for 30 > > days. The documents should be searchable on a free text field and on > > > about 8 other fields. > > > > One of my requirements is to index and search documents in multiple > > languages. I would like to have the ability to stem and provide the > > advanced search features that are based on it. This will only affect > > > the free text field because the rest of the fields are in English. > > > > I can find out the language of the document before indexing and I > > might be able to provide the language to search on. I also need to > > have the ability to search across all indexed languages (there will > > be 20 in total). > > > > Given these requirements do you think this is doable with Solr? A > > major limiting factor is that I need to stick to the 1.2 GA version > > and I cannot utilize the multi-core features in the 1.3 trunk. > > > > I considered writing my own analyzer that will call the appropriate > > Lucene analyzer for the given language but I did not see any way for > > it to access the field that specifies the language of the document. > > > > Thanks, > > > > Eli > > > > p.s. I am looking for an experienced Lucene/Solr consultant to help > > with the design of this system. > > > > > >
Re: multi-language searching with Solr
Gereon, I think that you must have the same schema on each shard but I am not sure if it must also have the same analyzers. These are shards of one index and not multiple indexes. There is probably a way to get each shard to contain one language but then you end up with x servers for x languages, and some will be under utilized while other will be over utilized. Add to that fail-over and fault tolerance and you end up with a maintenance nightmare. Also, how would you scale this? Of course I am still pretty new to search and Solr/Lucene so I might be wrong :) The different fields per language or prefixing the language string to every term solutions suggested by Peter and Mike are starting to look better and better. Is it possible to write an analyzer wrapper that will also be aware of the locale field in the document and delegate processing to the appropriate analyzer? Thanks, Eli On Wed, May 7, 2008 at 3:46 PM, Gereon Steffens <[EMAIL PROTECTED]> wrote: > I have the same requirement, and from what I understand the distributed > search feature will help implementing this, by having one shard per > language. Am I right? > > Gereon > > > > > Mike Klaas wrote: > > > On 5-May-08, at 1:28 PM, Eli K wrote: > > > > > > > Wouldn't this impact both indexing and search performance and the size > > > of the index? > > > It is also probable that I will have more then one free text fields > > > later on and with at least 20 languages this approach does not seem > > > very manageable. Are there other options for making this work with > > > stemming? > > > > > > > If you want stemming, then you have to execute one query per language > anyway, since the stemming will be different in every language. > > > > This is a fundamental requirement: you somehow need to track the language > of every token if you want correct multi-language stemming. The easiest way > to do this would be to split each language into its own field. But there > are other options: you could prefix every indexed token with the language: > > > > en:The en:quick en:brown en:fox en:jumped ... > > fr:Le fr:brun fr:renard fr:vite fr:a fr:sauté ... > > > > Separate fields seems easier to me, though. > > > > -Mike > > > > > > >
Re: multi-language searching with Solr
If you only have 2 languages this approach might work for you. This is not something I would consider with the number of languages I need to support. Eli On Thu, May 8, 2008 at 5:51 AM, Gereon Steffens <[EMAIL PROTECTED]> wrote: > > > > These are shards of one index and not multiple indexes. There is > > probably a way to get each shard to contain one language but then you > > end up with x servers for x languages, and some will be under utilized > > while other will be over utilized. > > > Schemas will be identical, except for analysers. The language distribution > I'm dealing with is about 60% german, 40% english. For availability reasons, > each shard needs to run on at least two instances anyway, with a load > balancer in front, so I think I'll be able to adjust utilization that way. > > Gereon >