Re: Stemming and other tokenizers

2011-09-20 Thread Pranav Prakash
I have a similar use case, but slightly more flexible and straight forward. In my case, I have a field "language" which stores 'en', 'es' or whatever the language of the document is. Then the field 'transcript' stores the actual content which is in the language as described in language field. Follo

Re: Stemming and other tokenizers

2011-09-12 Thread Jan Høydahl
Hi, Do they? Can you explain the layout of the documents? There are two ways to handle multi lingual docs. If all your docs have both an English and a Norwegian version, you may either split these into two separate documents, each with the "language" field filled by LangId - which then also l

Re: Stemming and other tokenizers

2011-09-12 Thread Manish Bafna
What is single document has multiple languages? On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl wrote: > Hi > > Everybody else use dedicated field per language, so why can't you? > Please explain your use case, and perhaps we can better help understand > what you're trying to do. > Do you always kn

Re: Stemming and other tokenizers

2011-09-12 Thread Jan Høydahl
Hi Everybody else use dedicated field per language, so why can't you? Please explain your use case, and perhaps we can better help understand what you're trying to do. Do you always know the query language in advance? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Sol

Re: Stemming and other tokenizers

2011-09-11 Thread Patrick Sauts
I can't create one field per language, that is the problem but I'll dig into it following your indications. I let you know what I could come out with. Patrick. 2011/9/11 Jan Høydahl > Hi, > > You'll not be able to detect language and change stemmer on the same field > in one go. You need to cre

Re: Stemming and other tokenizers

2011-09-11 Thread Jan Høydahl
Hi, You'll not be able to detect language and change stemmer on the same field in one go. You need to create one fieldType in your schema per language you want to use, and then use LanguageIdentification (SOLR-1979) to do the magic of detecting language and renaming the field. If you set langid

Stemming and other tokenizers

2011-09-09 Thread Patrick Sauts
Hello, I want to implement some king of AutoStemming that will detect the language of a field based on a tag at the start of this field like #en# my field is stored on disc but I don't want this tag to be stored. Is there a way to avoid this field to be stored ? To me all the filters and the t