Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Gabriele Kahlout Tue, 05 Jul 2011 05:26:38 -0700

I suspect the following should do (1). I'm just not sure about file
references as in  stopInit.put("words", "stopwords.txt") . (2) should
clarify.


1)
class SchemaAnalyzer extends Analyzer{

        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
            HashMap<String, String> stopInit = new HashMap<String,String>();
            stopInit.put("words", "stopwords.txt");
            stopInit.put("ignoreCase", Boolean.TRUE.toString());
            StopFilterFactory stopFilterFactory = new StopFilterFactory();
            stopFilterFactory.init(stopInit);

            final HashMap<String, String> wordDelimInit = new
HashMap<String, String>();
            wordDelimInit.put("generateWordParts", "1");
            wordDelimInit.put("generateNumberParts", "1");
            wordDelimInit.put("catenateWords", "1");
            wordDelimInit.put("catenateWords", "1");
            wordDelimInit.put("catenateNumbers", "1");
            wordDelimInit.put("catenateAll", "0");
            wordDelimInit.put("splitOnCaseChange", "1");

            WordDelimiterFilterFactory wordDelimiterFilterFactory = new
WordDelimiterFilterFactory();
            wordDelimiterFilterFactory.init(wordDelimInit);
            HashMap<String, String> porterInit = new HashMap<String,
String>();
            porterInit.put("protected", "protwords.txt");
            EnglishPorterFilterFactory englishPorterFilterFactory = new
EnglishPorterFilterFactory();
            englishPorterFilterFactory.init(porterInit);

            return new
RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new
LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new
WhitespaceTokenizer(reader))))));
        }
    }

On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout
<gabri...@mysimpatico.com>wrote:

> nice...where?
>
> I'm trying to figure out 2 things:
> 1) How to create an analyzer that corresponds to the one in the schema.xml.
>
>
>  <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1"/>
>       </analyzer>
>
> 2) I'd like to see the code that creates it reading it from schema.xml .
>
>
> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma <markus.jel...@openindex.io
> > wrote:
>
>> No. SolrJ only builds input docs from NutchDocument objects. Solr will do
>> analysis. The integration is analogous to XML post of Solr documents.
>>
>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
>> > Hello,
>> >
>> > I'm trying to understand better Nutch and Solr integration. My
>> > understanding is that Documents are added to Solr index from
>> SolrWriter's
>> > write(NutchDocument doc) method. But does it make any use of the
>> > WhitespaceTokenizerFactory?
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Reply via email to