Hi, Ted Dunning, Thank you for your reply. I can understand your point on putting a "language_s" field and then keeping all the files together, which speed-up searching.
But then there occurs a problem of using analyzer in indexing. I assume files encoded in different language should be handled using different analyzers, i.e. different tokenizers and filters. Can you elaborate a little bit on the design that you propose, especially in how files encoded in different languages can be handled during the indexing? Thank you. Best Regards Ni, Bing > From: ted.dunn...@gmail.com > Date: Fri, 20 Jan 2012 03:55:48 +0000 > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > To: solr-user@lucene.apache.org > > Normally this is done by putting a field on each document rather than > separating the documents into separate corpora. Keeping them together > makes the final search faster. > > At query time, you can add all of the language keys that you think are > relevant based on your language id applied to the query and then group the > results on those keys so that users can inspect different language results. > If you require the correct language key, you should get pretty good > retrieval speed. > > On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com> wrote: > > > > > Hi, Jan Høydahl You are right. I am hoping to detect the language of a > > query, so that the serarching can be done according to the language > > detected. Since people often type a few words, which is too few to detect, > > then it is hard to do that. Let me describe a little bit about the solr > > server in my design. It consists of several cores, corresponding to the > > several languages, which is built during indexing. Since language detection > > in indexing can be done with Tika identifier, then we are currently OK. But > > the problem is about searching. I want to do language detection first > > before do searching in the individual cores. In the case that detection > > result is ambiguous and several languages are returned, we probably returns > > a set of results, and let user to decide which language set of results they > > want to look into. In general, it is just the same with the language > > supported by google. Do you have some suggestions if I want to achieve > > multilingual search described as above? Thank you. > > Best Regards > > Ni, Bing > > > > > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > > > From: jan....@cominvent.com > > > Date: Thu, 19 Jan 2012 12:31:01 +0100 > > > To: solr-user@lucene.apache.org > > > > > > Hi, > > > > > > You may use the string as you choose, for instance filtering > > (fq=language_s:en) or for faceting (facet.field=language_s). What are you > > looking to do? > > > > > > What would you like to detect on the query side? The language of the > > search string? That is very hard since people type very few words into the > > search box. > > > > > > -- > > > Jan Høydahl, search solution architect > > > Cominvent AS - www.cominvent.com > > > Solr Training - www.solrtraining.com > > > > > > On 19. jan. 2012, at 09:22, nibing wrote: > > > > > > > > > > > Hi, all, > > > > > > > > > > > > > > > > I am using Solr3.5.0 which applies Tika0.10 to do language detection, > > > > and I have a couple of questions about this function. > > > > > > > > > > > > > > > > 1. I can see the outcome of the language detection in a field > > > > "language_s". But what action will be taken according to the different > > > > language code? How to configure? > > > > > > > > > > > > > > > > 2. Currently the language detection only happens in indexing. Is it > > > > possible to use the function in searching as well? How to configure? > > > > > > > > > > > > > > > > Many thanks. > > > > > > > > > > > > Best Regards > > > > > > > > Ni, Bing > > > > > > > > >