Ni, Bing I believe you will need to pre-define fields for all languages you want to handle and specify an appropriate language-specific analyzer for each of those fields. This also means that if you encounter a new language, you will need to adjust your schema to support a new language. Of course, your Lang Identifier will also need to be trained to recognize that language.
I know ElasticSearch allows you to specify a field analyzer at index time, which is very handy. I don't think you can do that with Solr today, but maybe I'm forgetting something! :) Otis ---- Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html ----- Original Message ----- > From: nibing <nibing_...@hotmail.com> > To: solr-user@lucene.apache.org > Cc: > Sent: Friday, January 20, 2012 1:51 AM > Subject: RE: Tika0.10 language identifier in Solr3.5.0 > > > Hi, Ted Dunning, > > Thank you for your reply. I can understand your point on putting a > "language_s" field and then keeping all the files together, which > speed-up searching. > > But then there occurs a problem of using analyzer in indexing. I assume files > encoded in different language should be handled using different analyzers, > i.e. > different tokenizers and filters. Can you elaborate a little bit on the > design > that you propose, especially in how files encoded in different languages can > be > handled during the indexing? > Thank you. > > Best Regards > > Ni, Bing > > >> From: ted.dunn...@gmail.com >> Date: Fri, 20 Jan 2012 03:55:48 +0000 >> Subject: Re: Tika0.10 language identifier in Solr3.5.0 >> To: solr-user@lucene.apache.org >> >> Normally this is done by putting a field on each document rather than >> separating the documents into separate corpora. Keeping them together >> makes the final search faster. >> >> At query time, you can add all of the language keys that you think are >> relevant based on your language id applied to the query and then group the >> results on those keys so that users can inspect different language results. >> If you require the correct language key, you should get pretty good >> retrieval speed. >> >> On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com> > wrote: >> >> > >> > Hi, Jan Høydahl You are right. I am hoping to detect the language of > a >> > query, so that the serarching can be done according to the language >> > detected. Since people often type a few words, which is too few to > detect, >> > then it is hard to do that. Let me describe a little bit about the > solr >> > server in my design. It consists of several cores, corresponding to > the >> > several languages, which is built during indexing. Since language > detection >> > in indexing can be done with Tika identifier, then we are currently > OK. But >> > the problem is about searching. I want to do language detection first >> > before do searching in the individual cores. In the case that > detection >> > result is ambiguous and several languages are returned, we probably > returns >> > a set of results, and let user to decide which language set of results > they >> > want to look into. In general, it is just the same with the language >> > supported by google. Do you have some suggestions if I want to achieve >> > multilingual search described as above? Thank you. >> > Best Regards >> > Ni, Bing >> > >> > > Subject: Re: Tika0.10 language identifier in Solr3.5.0 >> > > From: jan....@cominvent.com >> > > Date: Thu, 19 Jan 2012 12:31:01 +0100 >> > > To: solr-user@lucene.apache.org >> > > >> > > Hi, >> > > >> > > You may use the string as you choose, for instance filtering >> > (fq=language_s:en) or for faceting (facet.field=language_s). What are > you >> > looking to do? >> > > >> > > What would you like to detect on the query side? The language of > the >> > search string? That is very hard since people type very few words into > the >> > search box. >> > > >> > > -- >> > > Jan Høydahl, search solution architect >> > > Cominvent AS - www.cominvent.com >> > > Solr Training - www.solrtraining.com >> > > >> > > On 19. jan. 2012, at 09:22, nibing wrote: >> > > >> > > > >> > > > Hi, all, >> > > > >> > > > >> > > > >> > > > I am using Solr3.5.0 which applies Tika0.10 to do language > detection, >> > > > and I have a couple of questions about this function. >> > > > >> > > > >> > > > >> > > > 1. I can see the outcome of the language detection in a > field >> > > > "language_s". But what action will be taken > according to the different >> > > > language code? How to configure? >> > > > >> > > > >> > > > >> > > > 2. Currently the language detection only happens in > indexing. Is it >> > > > possible to use the function in searching as well? How to > configure? >> > > > >> > > > >> > > > >> > > > Many thanks. >> > > > >> > > > >> > > > Best Regards >> > > > >> > > > Ni, Bing >> > > > >> > > >> > >