Re: How to set up RussianAnalyzer?

Chris Hostetter Mon, 27 Nov 2006 11:29:40 -0800

: I downloaded lucene-analyzers-2.0.0.jar and placed it in the lib
: directory (tried both solr/lib and solr/example/lib) but I keep getting
: the same error (ClassNotFoundException, see stack trace below) when
: starting the server. Have I missed a step here??


"solr/lib" is the lib dir used when compiling solr, "solr/example/lib" is
the lib ir used by Jetty ... whaty you want is "${solr.home}/lib" ...
which in the example install of Jetty would be "solr/example/solr/lib"

...it's a bit confusing i know.

: definite answer to my question above. I have also investigated the next
: step which is the ability to add stop words, synonyms etc. It seems that
: I will have to create my own Factory(ies?) but didnât find a detailed
: explanation on how to do this. I found this useful:

Adding new Factories requires a some knowledge of writing/compiling Java
classes ... if you have a basic understanding of Java development, there
isn't really a lot of specific Solr/Lucene information you need, you just
subclass one of the eisting Base Factory classes and define your own
"create" method ... things only get tricky if you need to do anything
complicated with configuration arguments.

If you wanted to write new factories for the RussianLetterTokenizer and
Russian*Filter classes, they would probably look very similar to the
example in the URL you mentioned, but with some simple argument processing
to decide which charset to use, maybe something like this...

        public class RussianStemFilterFactory extends BaseTokenizerFactory {
          public TokenStream create(TokenStream input) {
            String charsetName = getArgs().get("charset");
            char[] charset = RussianCharsets.UnicodeRussian;
            if (charsetName.equals("KOI8") charset = RussianCharsets.KOI8;
            if (charsetName.equals("CP1251") charset = RussianCharsets.CP1251;
            return new RussianStemFilterFactory(input, charset);
          }
        }

Then you could use them like this...

    <fieldtype name="textRU" class="solr.TextField">
      <analyzer>
        <tokenizer class="yourpackage.RussianLetterTokenizerFactory"
                   charset="CP1251"/>
        <filter class="yourpackage.RussianLowerCaseFilterFactory"
                charset="CP1251"/>
        <filter class="solr.StopFilterFactory
                words="yourwords.txt""/>
        <filter class="yourpackage.RussianStemFilterFactory"
                charset="CP1251"/>
      </analyzer>
    </fieldtype>




-Hoss

Re: How to set up RussianAnalyzer?

Reply via email to