Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Gabriele Kahlout Tue, 05 Jul 2011 05:48:50 -0700

Not yet an answer to 2) but this is where and how Solr initializes the
Analyzer defined in the schema.xml into :


//org.apache.solr.schema.IndexSchema
 // Load the Tokenizer
    // Although an analyzer only allows a single Tokenizer, we load a list
to make sure
    // the configuration is ok
    //
--------------------------------------------------------------------------------
    final ArrayList<TokenizerFactory> tokenizers = new
ArrayList<TokenizerFactory>(1);
    AbstractPluginLoader<TokenizerFactory> tokenizerLoader =
      new AbstractPluginLoader<TokenizerFactory>( "[schema.xml]
analyzer/tokenizer", false, false )
    {
      @Override
      protected void init(TokenizerFactory plugin, Node node) throws
Exception {
        if( !tokenizers.isEmpty() ) {
          throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
              "The schema defines multiple tokenizers for: "+node );
        }
        final Map<String,String> params =
DOMUtil.toMapExcept(node.getAttributes(),"class");
        // copy the luceneMatchVersion from config, if not set
        if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
          params.put(LUCENE_MATCH_VERSION_PARAM,
solrConfig.luceneMatchVersion.toString());
        plugin.init( params );
        tokenizers.add( plugin );
      }

      @Override
      protected TokenizerFactory register(String name, TokenizerFactory
plugin) throws Exception {
        return null; // used for map registration
      }
    };
    tokenizerLoader.load( loader, (NodeList)xpath.evaluate("./tokenizer",
node, XPathConstants.NODESET) );

    // Make sure something was loaded
    if( tokenizers.isEmpty() ) {
      throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,"analyzer
without class or tokenizer & filter list");
    }


    // Load the Filters
    //
--------------------------------------------------------------------------------
    final ArrayList<TokenFilterFactory> filters = new
ArrayList<TokenFilterFactory>();
    AbstractPluginLoader<TokenFilterFactory> filterLoader =
      new AbstractPluginLoader<TokenFilterFactory>( "[schema.xml]
analyzer/filter", false, false )
    {
      @Override
      protected void init(TokenFilterFactory plugin, Node node) throws
Exception {
        if( plugin != null ) {
          final Map<String,String> params =
DOMUtil.toMapExcept(node.getAttributes(),"class");
          // copy the luceneMatchVersion from config, if not set
          if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM))
            params.put(LUCENE_MATCH_VERSION_PARAM,
solrConfig.luceneMatchVersion.toString());
          plugin.init( params );
          filters.add( plugin );
        }
      }

      @Override
      protected TokenFilterFactory register(String name, TokenFilterFactory
plugin) throws Exception {
        return null; // used for map registration
      }
    };
    filterLoader.load( loader, (NodeList)xpath.evaluate("./filter", node,
XPathConstants.NODESET) );

    return new TokenizerChain(charFilters.toArray(new
CharFilterFactory[charFilters.size()]),
        tokenizers.get(0), filters.toArray(new
TokenFilterFactory[filters.size()]));
  };


On Tue, Jul 5, 2011 at 2:26 PM, Gabriele Kahlout
<gabri...@mysimpatico.com>wrote:

> I suspect the following should do (1). I'm just not sure about file
> references as in  stopInit.put("words", "stopwords.txt") . (2) should
> clarify.
>
> 1)
> class SchemaAnalyzer extends Analyzer{
>
>         @Override
>         public TokenStream tokenStream(String fieldName, Reader reader) {
>             HashMap<String, String> stopInit = new
> HashMap<String,String>();
>             stopInit.put("words", "stopwords.txt");
>             stopInit.put("ignoreCase", Boolean.TRUE.toString());
>             StopFilterFactory stopFilterFactory = new StopFilterFactory();
>             stopFilterFactory.init(stopInit);
>
>             final HashMap<String, String> wordDelimInit = new
> HashMap<String, String>();
>             wordDelimInit.put("generateWordParts", "1");
>             wordDelimInit.put("generateNumberParts", "1");
>             wordDelimInit.put("catenateWords", "1");
>             wordDelimInit.put("catenateWords", "1");
>             wordDelimInit.put("catenateNumbers", "1");
>             wordDelimInit.put("catenateAll", "0");
>             wordDelimInit.put("splitOnCaseChange", "1");
>
>             WordDelimiterFilterFactory wordDelimiterFilterFactory = new
> WordDelimiterFilterFactory();
>             wordDelimiterFilterFactory.init(wordDelimInit);
>             HashMap<String, String> porterInit = new HashMap<String,
> String>();
>             porterInit.put("protected", "protwords.txt");
>             EnglishPorterFilterFactory englishPorterFilterFactory = new
> EnglishPorterFilterFactory();
>             englishPorterFilterFactory.init(porterInit);
>
>             return new
> RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new
> LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new
> WhitespaceTokenizer(reader))))));
>         }
>     }
>
> On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout <gabri...@mysimpatico.com
> > wrote:
>
>> nice...where?
>>
>> I'm trying to figure out 2 things:
>> 1) How to create an analyzer that corresponds to the one in the
>> schema.xml.
>>
>>  <analyzer>
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1"/>
>>       </analyzer>
>>
>> 2) I'd like to see the code that creates it reading it from schema.xml .
>>
>>
>> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma <
>> markus.jel...@openindex.io> wrote:
>>
>>> No. SolrJ only builds input docs from NutchDocument objects. Solr will do
>>> analysis. The integration is analogous to XML post of Solr documents.
>>>
>>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote:
>>> > Hello,
>>> >
>>> > I'm trying to understand better Nutch and Solr integration. My
>>> > understanding is that Documents are added to Solr index from
>>> SolrWriter's
>>> > write(NutchDocument doc) method. But does it make any use of the
>>> > WhitespaceTokenizerFactory?
>>>
>>> --
>>> Markus Jelsma - CTO - Openindex
>>> http://www.linkedin.com/in/markus17
>>> 050-8536620 / 06-50258350
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Does Nutch make any use of solr.WhitespaceTokenizerFactory defined in schema.xml?

Reply via email to