Re: Implementing custom analyzer for multi-language stemming

2014-09-18 Thread roman-v1
Is there a way to set attribute in tokenizer to document to search by word and this attribute? -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-for-multi-language-stemming-tp4150156p4159594.html Sent from the Solr - User mailing list archive at Na

Re: Implementing custom analyzer for multi-language stemming

2014-09-18 Thread atawfik
Hi, The author of Solr in Action has produced something similar to what you want. I even has used it for one of my projects where I needed to automatically analyze languages. Here is the link to its code https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

Re: Implementing custom analyzer for multi-language stemming

2014-09-17 Thread roman-v1
If each token have a languageattribute on it, when I search by word and language and if hightlighting is switched on, each word of sentence will be highlighted. Because of it this solution not fit. -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-custom-analyzer-

Re: Implementing custom analyzer for multi-language stemming

2014-08-06 Thread Rich Cariens
Yes, each token could have a LanguageAttribute on it, just like ScriptAttributes. I didn't *think* a span would be necessary. I would also add a multivalued "lang" field to the document. Searching English documents for "die" might look like: "q=die&lang=eng". The "lang" param could tell the Reques

Re: Implementing custom analyzer for multi-language stemming

2014-08-05 Thread TK
On 8/5/14, 8:36 AM, Rich Cariens wrote: Of course this is extremely primitive and basic, but I think it would be possible to write a CharFilter or TokenFilter that inspects the entire TokenStream to guess the language(s), perhaps even noting where languages change. Language and position informat

Re: Implementing custom analyzer for multi-language stemming

2014-08-05 Thread Rich Cariens
I've started a GitHub project to try out some cross-lingual analysis ideas ( https://github.com/whateverdood/cross-lingual-search). I haven't played over there for about 3 months, but plan on restarting work there shortly. In a nutshell, the interesting component ("SimplePolyGlotStemmingTokenFilter

Re: Implementing custom analyzer for multi-language stemming

2014-08-04 Thread TK
On 7/30/14, 10:47 AM, Eugene wrote: Hello, fellow Solr and Lucene users and developers! In our project we receive text from users in different languages. We detect language automatically and use Google Translate APIs a lot (so having arbitrary number of languages in our system doesn't

Re: Implementing custom analyzer for multi-language stemming

2014-08-02 Thread Umesh Prasad
Also, take a look at the Lucid revolution talk Typed Index https://www.youtube.com/watch?v=X93DaRfi790 *Published on 25 Nov 2013* Presented by Christoph Goller, Chief Scientist, IntraFind Software AG If you want to search in a multilingual environment with high-quality language-specific word-no

Re: Implementing custom analyzer for multi-language stemming

2014-07-30 Thread Sujit Pal
Hi Eugene, In a system we built couple of years ago, we had a corpus of English and French mixed (and Spanish on the way but that was implemented by client after we handed off). We had different fields for each language. So (title, body) for English docs was (title_en, body_en), for French (title_

re: Implementing custom analyzer for multi-language stemming

2014-07-30 Thread Chris Morley
I know BasisTech.com has a plugin for elasticsearch that extends stemming/lemmatization to work across 40 natural languages. I'm not sure what they have for Solr, but I think something like that may exist as well. Cheers, -Chris. From: "Eugene" Sent: W