Not yet an answer to 2) but this is where and how Solr initializes the Analyzer defined in the schema.xml into :
//org.apache.solr.schema.IndexSchema // Load the Tokenizer // Although an analyzer only allows a single Tokenizer, we load a list to make sure // the configuration is ok // -------------------------------------------------------------------------------- final ArrayList<TokenizerFactory> tokenizers = new ArrayList<TokenizerFactory>(1); AbstractPluginLoader<TokenizerFactory> tokenizerLoader = new AbstractPluginLoader<TokenizerFactory>( "[schema.xml] analyzer/tokenizer", false, false ) { @Override protected void init(TokenizerFactory plugin, Node node) throws Exception { if( !tokenizers.isEmpty() ) { throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, "The schema defines multiple tokenizers for: "+node ); } final Map<String,String> params = DOMUtil.toMapExcept(node.getAttributes(),"class"); // copy the luceneMatchVersion from config, if not set if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM)) params.put(LUCENE_MATCH_VERSION_PARAM, solrConfig.luceneMatchVersion.toString()); plugin.init( params ); tokenizers.add( plugin ); } @Override protected TokenizerFactory register(String name, TokenizerFactory plugin) throws Exception { return null; // used for map registration } }; tokenizerLoader.load( loader, (NodeList)xpath.evaluate("./tokenizer", node, XPathConstants.NODESET) ); // Make sure something was loaded if( tokenizers.isEmpty() ) { throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,"analyzer without class or tokenizer & filter list"); } // Load the Filters // -------------------------------------------------------------------------------- final ArrayList<TokenFilterFactory> filters = new ArrayList<TokenFilterFactory>(); AbstractPluginLoader<TokenFilterFactory> filterLoader = new AbstractPluginLoader<TokenFilterFactory>( "[schema.xml] analyzer/filter", false, false ) { @Override protected void init(TokenFilterFactory plugin, Node node) throws Exception { if( plugin != null ) { final Map<String,String> params = DOMUtil.toMapExcept(node.getAttributes(),"class"); // copy the luceneMatchVersion from config, if not set if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM)) params.put(LUCENE_MATCH_VERSION_PARAM, solrConfig.luceneMatchVersion.toString()); plugin.init( params ); filters.add( plugin ); } } @Override protected TokenFilterFactory register(String name, TokenFilterFactory plugin) throws Exception { return null; // used for map registration } }; filterLoader.load( loader, (NodeList)xpath.evaluate("./filter", node, XPathConstants.NODESET) ); return new TokenizerChain(charFilters.toArray(new CharFilterFactory[charFilters.size()]), tokenizers.get(0), filters.toArray(new TokenFilterFactory[filters.size()])); }; On Tue, Jul 5, 2011 at 2:26 PM, Gabriele Kahlout <gabri...@mysimpatico.com>wrote: > I suspect the following should do (1). I'm just not sure about file > references as in stopInit.put("words", "stopwords.txt") . (2) should > clarify. > > 1) > class SchemaAnalyzer extends Analyzer{ > > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > HashMap<String, String> stopInit = new > HashMap<String,String>(); > stopInit.put("words", "stopwords.txt"); > stopInit.put("ignoreCase", Boolean.TRUE.toString()); > StopFilterFactory stopFilterFactory = new StopFilterFactory(); > stopFilterFactory.init(stopInit); > > final HashMap<String, String> wordDelimInit = new > HashMap<String, String>(); > wordDelimInit.put("generateWordParts", "1"); > wordDelimInit.put("generateNumberParts", "1"); > wordDelimInit.put("catenateWords", "1"); > wordDelimInit.put("catenateWords", "1"); > wordDelimInit.put("catenateNumbers", "1"); > wordDelimInit.put("catenateAll", "0"); > wordDelimInit.put("splitOnCaseChange", "1"); > > WordDelimiterFilterFactory wordDelimiterFilterFactory = new > WordDelimiterFilterFactory(); > wordDelimiterFilterFactory.init(wordDelimInit); > HashMap<String, String> porterInit = new HashMap<String, > String>(); > porterInit.put("protected", "protwords.txt"); > EnglishPorterFilterFactory englishPorterFilterFactory = new > EnglishPorterFilterFactory(); > englishPorterFilterFactory.init(porterInit); > > return new > RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new > LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new > WhitespaceTokenizer(reader)))))); > } > } > > On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout <gabri...@mysimpatico.com > > wrote: > >> nice...where? >> >> I'm trying to figure out 2 things: >> 1) How to create an analyzer that corresponds to the one in the >> schema.xml. >> >> <analyzer> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.WordDelimiterFilterFactory" >> generateWordParts="1" generateNumberParts="1"/> >> </analyzer> >> >> 2) I'd like to see the code that creates it reading it from schema.xml . >> >> >> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma < >> markus.jel...@openindex.io> wrote: >> >>> No. SolrJ only builds input docs from NutchDocument objects. Solr will do >>> analysis. The integration is analogous to XML post of Solr documents. >>> >>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote: >>> > Hello, >>> > >>> > I'm trying to understand better Nutch and Solr integration. My >>> > understanding is that Documents are added to Solr index from >>> SolrWriter's >>> > write(NutchDocument doc) method. But does it make any use of the >>> > WhitespaceTokenizerFactory? >>> >>> -- >>> Markus Jelsma - CTO - Openindex >>> http://www.linkedin.com/in/markus17 >>> 050-8536620 / 06-50258350 >>> >> >> >> >> -- >> Regards, >> K. Gabriele >> >> --- unchanged since 20/9/10 --- >> P.S. If the subject contains "[LON]" or the addressee acknowledges the >> receipt within 48 hours then I don't resend the email. >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >> time(x) < Now + 48h) ⇒ ¬resend(I, this). >> >> If an email is sent by a sender that is not a trusted contact or the email >> does not contain a valid code then the email is not received. A valid code >> starts with a hyphen and ends with "X". >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >> L(-[a-z]+[0-9]X)). >> >> > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).