Otis, Can you say why there needs to be a field per language? Why not have a polyglot analyzer?
On Fri, Jan 20, 2012 at 7:29 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Ni, Bing > > I believe you will need to pre-define fields for all languages you want to > handle and specify an appropriate language-specific analyzer for each of > those fields. > This also means that if you encounter a new language, you will need to > adjust your schema to support a new language. Of course, your Lang > Identifier will also need to be trained to recognize that language. > > I know ElasticSearch allows you to specify a field analyzer at index time, > which is very handy. I don't think you can do that with Solr today, but > maybe I'm forgetting something! :) > > Otis > ---- > Performance Monitoring SaaS for Solr - > http://sematext.com/spm/solr-performance-monitoring/index.html > > > ----- Original Message ----- > > From: nibing <nibing_...@hotmail.com> > > To: solr-user@lucene.apache.org > > Cc: > > Sent: Friday, January 20, 2012 1:51 AM > > Subject: RE: Tika0.10 language identifier in Solr3.5.0 > > > > > > Hi, Ted Dunning, > > > > Thank you for your reply. I can understand your point on putting a > > "language_s" field and then keeping all the files together, which > > speed-up searching. > > > > But then there occurs a problem of using analyzer in indexing. I assume > files > > encoded in different language should be handled using different > analyzers, i.e. > > different tokenizers and filters. Can you elaborate a little bit on the > design > > that you propose, especially in how files encoded in different languages > can be > > handled during the indexing? > > Thank you. > > > > Best Regards > > > > Ni, Bing > > > > > >> From: ted.dunn...@gmail.com > >> Date: Fri, 20 Jan 2012 03:55:48 +0000 > >> Subject: Re: Tika0.10 language identifier in Solr3.5.0 > >> To: solr-user@lucene.apache.org > >> > >> Normally this is done by putting a field on each document rather than > >> separating the documents into separate corpora. Keeping them together > >> makes the final search faster. > >> > >> At query time, you can add all of the language keys that you think are > >> relevant based on your language id applied to the query and then group > the > >> results on those keys so that users can inspect different language > results. > >> If you require the correct language key, you should get pretty good > >> retrieval speed. > >> > >> On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com> > > wrote: > >> > >> > > >> > Hi, Jan Høydahl You are right. I am hoping to detect the language > of > > a > >> > query, so that the serarching can be done according to the language > >> > detected. Since people often type a few words, which is too few to > > detect, > >> > then it is hard to do that. Let me describe a little bit about the > > solr > >> > server in my design. It consists of several cores, corresponding to > > the > >> > several languages, which is built during indexing. Since language > > detection > >> > in indexing can be done with Tika identifier, then we are currently > > OK. But > >> > the problem is about searching. I want to do language detection first > >> > before do searching in the individual cores. In the case that > > detection > >> > result is ambiguous and several languages are returned, we probably > > returns > >> > a set of results, and let user to decide which language set of > results > > they > >> > want to look into. In general, it is just the same with the language > >> > supported by google. Do you have some suggestions if I want to > achieve > >> > multilingual search described as above? Thank you. > >> > Best Regards > >> > Ni, Bing > >> > > >> > > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > >> > > From: jan....@cominvent.com > >> > > Date: Thu, 19 Jan 2012 12:31:01 +0100 > >> > > To: solr-user@lucene.apache.org > >> > > > >> > > Hi, > >> > > > >> > > You may use the string as you choose, for instance filtering > >> > (fq=language_s:en) or for faceting (facet.field=language_s). What are > > you > >> > looking to do? > >> > > > >> > > What would you like to detect on the query side? The language of > > the > >> > search string? That is very hard since people type very few words > into > > the > >> > search box. > >> > > > >> > > -- > >> > > Jan Høydahl, search solution architect > >> > > Cominvent AS - www.cominvent.com > >> > > Solr Training - www.solrtraining.com > >> > > > >> > > On 19. jan. 2012, at 09:22, nibing wrote: > >> > > > >> > > > > >> > > > Hi, all, > >> > > > > >> > > > > >> > > > > >> > > > I am using Solr3.5.0 which applies Tika0.10 to do language > > detection, > >> > > > and I have a couple of questions about this function. > >> > > > > >> > > > > >> > > > > >> > > > 1. I can see the outcome of the language detection in a > > field > >> > > > "language_s". But what action will be taken > > according to the different > >> > > > language code? How to configure? > >> > > > > >> > > > > >> > > > > >> > > > 2. Currently the language detection only happens in > > indexing. Is it > >> > > > possible to use the function in searching as well? How to > > configure? > >> > > > > >> > > > > >> > > > > >> > > > Many thanks. > >> > > > > >> > > > > >> > > > Best Regards > >> > > > > >> > > > Ni, Bing > >> > > > > >> > > > >> > > > >