Re: Problem with Russian stemmer in Solr 1.2

Daniel Alheiros Tue, 17 Jul 2007 04:07:35 -0700

Hi Andrew.

This is an example for one FilterFactory:


public class RussianStemFilterFactory extends BaseTokenFilterFactory {
private String charset;        /**     * @see
org.apache.solr.analysis.BaseTokenFilterFactory#init(java.util.Map)     */
@Override    public void init(Map<String, String> arg0)    {
super.init(arg0);                String charsetName =
args.get("charsetName");        this.charset = charsetName;        }    }
/**     * @see 
org.apache.solr.analysis.TokenFilterFactory#create(org.apache.lucene.analysi
s.TokenStream)     */    public TokenStream create(TokenStream tokenStream)
{        return new RussianStemFilter(tokenStream, charset.getChars());    }
}


When you run the args.get(String) you are going to get a property defined in
your schema.xml like this:
                <filter class="myCompany.RussianStemFilterFactory"
charsetName="UnicodeRussian"/>


For a tokenizer that prepares for your filters:
public class HTMLStripRussianLetterTokenizerFactory extends
BaseTokenizerFactory {    private char[]    charset;     /**     * @see
org.apache.solr.analysis.BaseTokenizerFactory#init(java.util.Map)     */
@Override    public void init(Map<String, String> arg0)    {
super.init(arg0);         String charsetName = args.get("charsetName");
this.charset = charsetName.getChars();    }     /**     * @see
org.apache.solr.analysis.TokenizerFactory#create(Reader)     */    public
TokenStream create(Reader reader)    {        return new
RussianLetterTokenizer(new HTMLStripReader(reader), this.charset);    } }

<tokenizer class="myCompany.HTMLStripRussianLetterTokenizerFactory"
charsetName="UnicodeRussian"/>

I hope it helps.

Regards,
Daniel

On 17/7/07 11:34, "Andrew Stromnov" <[EMAIL PROTECTED]> wrote:

> 
> Hi Daniel
> 
> How to implement custom Russian factory with various Tokenizers and Filters?
> 
> Can you provide some code examples?
> 
> Regards,
> Andrew
> 
> 
> Daniel Alheiros wrote:
>> 
>> Hi Andrew
>> 
>> Yes, I saw that. As I'm not knowledgeable in Russian I had to infer it was
>> adequate. But as you have much more to add to it, it could be interesting
>> if
>> you could contribute that.
>> 
>> The problem is Russian analyzer and it's filters are all final class,
>> don't
>> allowing an elegant extension. But you can create an analyzer that reuse
>> what is interesting for you (in this case, the stemmer) and customize the
>> other filters. I would propose you to do that creating the Solr factories
>> so
>> you can point to your files containing your stopwords. Any chance you
>> could
>> contribute with this stopwords list?
>> 
>> One of my reasons to not use directly the RussianAnalyzer was that I need
>> to
>> use an WhitespaceTokenizer removing HTML code... So I created my
>> factories.
>> 
>> Regards,
>> Daniel 
>> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

Re: Problem with Russian stemmer in Solr 1.2

Reply via email to