Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Gabriele Kahlout Tue, 05 Jul 2011 07:15:29 -0700

the answer to 2) is new IndexSchema(solrConf, schema).getAnalyzer();


On Tue, Jul 5, 2011 at 2:48 PM, Gabriele Kahlout
<gabri...@mysimpatico.com>wrote:

> Not yet an answer to 2) but this is where and how Solr initializes the
> Analyzer defined in the schema.xml into :
>
> //org.apache.solr.schema.IndexSchema
>  // Load the Tokenizer
>     // Although an analyzer only allows a single Tokenizer, we load a list
> to make sure
>     // the configuration is ok
>     //
> --------------------------------------------------------------------------------
>     final ArrayList<TokenizerFactory> tokenizers = new
> ArrayList<TokenizerFactory>(1);
>     AbstractPluginLoader<TokenizerFactory> tokenizerLoader =
>       new AbstractPluginLoader<TokenizerFactory>( "[schema.xml]
> analyzer/tokenizer", false, false )
>     {
>       @Override
>       protected void init(TokenizerFactory plugin, Node node) throws
> Exception {
>         if( !tokenizers.isEmpty() ) {
>           throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
>               "The schema defines multiple tokenizers for: "+node );
>         }
>         final Map<String,String> params =
> DOMUtil.toMapExcept(node.getAttributes(),"class");
>         // copy the luceneMatchVersion from config, if not set
>         if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
>           params.put(LUCENE_MATCH_VERSION_PARAM,
> solrConfig.luceneMatchVersion.toString());
>         plugin.init( params );
>         tokenizers.add( plugin );
>       }
>
>       @Override
>       protected TokenizerFactory register(String name, TokenizerFactory
> plugin) throws Exception {
>         return null; // used for map registration
>       }
>     };
>     tokenizerLoader.load( loader, (NodeList)xpath.evaluate("./tokenizer",
> node, XPathConstants.NODESET) );
>
>     // Make sure something was loaded
>     if( tokenizers.isEmpty() ) {
>       throw new
> SolrException(SolrException.ErrorCode.SERVER_ERROR,"analyzer without class
> or tokenizer & filter list");
>     }
>
>
>     // Load the Filters
>     //
> --------------------------------------------------------------------------------
>     final ArrayList<TokenFilterFactory> filters = new
> ArrayList<TokenFilterFactory>();
>     AbstractPluginLoader<TokenFilterFactory> filterLoader =
>       new AbstractPluginLoader<TokenFilterFactory>( "[schema.xml]
> analyzer/filter", false, false )
>     {
>       @Override
>       protected void init(TokenFilterFactory plugin, Node node) throws
> Exception {
>         if( plugin != null ) {
>           final Map<String,String> params =
> DOMUtil.toMapExcept(node.getAttributes(),"class");
>           // copy the luceneMatchVersion from config, if not set
>           if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
>             params.put(LUCENE_MATCH_VERSION_PARAM,
> solrConfig.luceneMatchVersion.toString());
>           plugin.init( params );
>           filters.add( plugin );
>         }
>       }
>
>       @Override
>       protected TokenFilterFactory register(String name, TokenFilterFactory
> plugin) throws Exception {
>         return null; // used for map registration
>       }
>     };
>     filterLoader.load( loader, (NodeList)xpath.evaluate("./filter", node,
> XPathConstants.NODESET) );
>
>     return new TokenizerChain(charFilters.toArray(new
> CharFilterFactory[charFilters.size()]),
>         tokenizers.get(0), filters.toArray(new
> TokenFilterFactory[filters.size()]));
>   };
>
>
>
> On Tue, Jul 5, 2011 at 2:26 PM, Gabriele Kahlout <gabri...@mysimpatico.com
> > wrote:
>
>> I suspect the following should do (1). I'm just not sure about file
>> references as in  stopInit.put("words", "stopwords.txt") . (2) should
>> clarify.
>>
>> 1)
>> class SchemaAnalyzer extends Analyzer{
>>
>>         @Override
>>         public TokenStream tokenStream(String fieldName, Reader reader) {
>>             HashMap<String, String> stopInit = new
>> HashMap<String,String>();
>>             stopInit.put("words", "stopwords.txt");
>>             stopInit.put("ignoreCase", Boolean.TRUE.toString());
>>             StopFilterFactory stopFilterFactory = new StopFilterFactory();
>>             stopFilterFactory.init(stopInit);
>>
>>             final HashMap<String, String> wordDelimInit = new
>> HashMap<String, String>();
>>             wordDelimInit.put("generateWordParts", "1");
>>             wordDelimInit.put("generateNumberParts", "1");
>>             wordDelimInit.put("catenateWords", "1");
>>             wordDelimInit.put("catenateWords", "1");
>>             wordDelimInit.put("catenateNumbers", "1");
>>             wordDelimInit.put("catenateAll", "0");
>>             wordDelimInit.put("splitOnCaseChange", "1");
>>
>>             WordDelimiterFilterFactory wordDelimiterFilterFactory = new
>> WordDelimiterFilterFactory();
>>             wordDelimiterFilterFactory.init(wordDelimInit);
>>             HashMap<String, String> porterInit = new HashMap<String,
>> String>();
>>             porterInit.put("protected", "protwords.txt");
>>             EnglishPorterFilterFactory englishPorterFilterFactory = new
>> EnglishPorterFilterFactory();
>>             englishPorterFilterFactory.init(porterInit);
>>
>>             return new
>> RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new
>> LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new
>> WhitespaceTokenizer(reader))))));
>>         }
>>     }
>>
>> On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout <
>> gabri...@mysimpatico.com> wrote:
>>
>>> nice...where?
>>>
>>> I'm trying to figure out 2 things:
>>> 1) How to create an analyzer that corresponds to the one in the
>>> schema.xml.
>>>
>>>  <analyzer>
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1"/>
>>>       </analyzer>
>>>
>>> 2) I'd like to see the code that creates it reading it from schema.xml .
>>>
>>>
>>> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma <
>>> markus.jel...@openindex.io> wrote:
>>>
>>>> No. SolrJ only builds input docs from NutchDocument objects. Solr will
>>>> do
>>>> analysis. The integration is analogous to XML post of Solr documents.
>>>>
>>>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
>>>> > Hello,
>>>> >
>>>> > I'm trying to understand better Nutch and Solr integration. My
>>>> > understanding is that Documents are added to Solr index from
>>>> SolrWriter's
>>>> > write(NutchDocument doc) method. But does it make any use of the
>>>> > WhitespaceTokenizerFactory?
>>>>
>>>> --
>>>> Markus Jelsma - CTO - Openindex
>>>> http://www.linkedin.com/in/markus17
>>>> 050-8536620 / 06-50258350
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A valid
>>> code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>>
>>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Reply via email to