the answer to 2) is new IndexSchema(solrConf, schema).getAnalyzer();
On Tue, Jul 5, 2011 at 2:48 PM, Gabriele Kahlout <gabri...@mysimpatico.com>wrote: > Not yet an answer to 2) but this is where and how Solr initializes the > Analyzer defined in the schema.xml into : > > //org.apache.solr.schema.IndexSchema > // Load the Tokenizer > // Although an analyzer only allows a single Tokenizer, we load a list > to make sure > // the configuration is ok > // > -------------------------------------------------------------------------------- > final ArrayList<TokenizerFactory> tokenizers = new > ArrayList<TokenizerFactory>(1); > AbstractPluginLoader<TokenizerFactory> tokenizerLoader = > new AbstractPluginLoader<TokenizerFactory>( "[schema.xml] > analyzer/tokenizer", false, false ) > { > @Override > protected void init(TokenizerFactory plugin, Node node) throws > Exception { > if( !tokenizers.isEmpty() ) { > throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, > "The schema defines multiple tokenizers for: "+node ); > } > final Map<String,String> params = > DOMUtil.toMapExcept(node.getAttributes(),"class"); > // copy the luceneMatchVersion from config, if not set > if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM)) > params.put(LUCENE_MATCH_VERSION_PARAM, > solrConfig.luceneMatchVersion.toString()); > plugin.init( params ); > tokenizers.add( plugin ); > } > > @Override > protected TokenizerFactory register(String name, TokenizerFactory > plugin) throws Exception { > return null; // used for map registration > } > }; > tokenizerLoader.load( loader, (NodeList)xpath.evaluate("./tokenizer", > node, XPathConstants.NODESET) ); > > // Make sure something was loaded > if( tokenizers.isEmpty() ) { > throw new > SolrException(SolrException.ErrorCode.SERVER_ERROR,"analyzer without class > or tokenizer & filter list"); > } > > > // Load the Filters > // > -------------------------------------------------------------------------------- > final ArrayList<TokenFilterFactory> filters = new > ArrayList<TokenFilterFactory>(); > AbstractPluginLoader<TokenFilterFactory> filterLoader = > new AbstractPluginLoader<TokenFilterFactory>( "[schema.xml] > analyzer/filter", false, false ) > { > @Override > protected void init(TokenFilterFactory plugin, Node node) throws > Exception { > if( plugin != null ) { > final Map<String,String> params = > DOMUtil.toMapExcept(node.getAttributes(),"class"); > // copy the luceneMatchVersion from config, if not set > if (!params.containsKey(LUCENE_MATCH_VERSION_PARAM)) > params.put(LUCENE_MATCH_VERSION_PARAM, > solrConfig.luceneMatchVersion.toString()); > plugin.init( params ); > filters.add( plugin ); > } > } > > @Override > protected TokenFilterFactory register(String name, TokenFilterFactory > plugin) throws Exception { > return null; // used for map registration > } > }; > filterLoader.load( loader, (NodeList)xpath.evaluate("./filter", node, > XPathConstants.NODESET) ); > > return new TokenizerChain(charFilters.toArray(new > CharFilterFactory[charFilters.size()]), > tokenizers.get(0), filters.toArray(new > TokenFilterFactory[filters.size()])); > }; > > > > On Tue, Jul 5, 2011 at 2:26 PM, Gabriele Kahlout <gabri...@mysimpatico.com > > wrote: > >> I suspect the following should do (1). I'm just not sure about file >> references as in stopInit.put("words", "stopwords.txt") . (2) should >> clarify. >> >> 1) >> class SchemaAnalyzer extends Analyzer{ >> >> @Override >> public TokenStream tokenStream(String fieldName, Reader reader) { >> HashMap<String, String> stopInit = new >> HashMap<String,String>(); >> stopInit.put("words", "stopwords.txt"); >> stopInit.put("ignoreCase", Boolean.TRUE.toString()); >> StopFilterFactory stopFilterFactory = new StopFilterFactory(); >> stopFilterFactory.init(stopInit); >> >> final HashMap<String, String> wordDelimInit = new >> HashMap<String, String>(); >> wordDelimInit.put("generateWordParts", "1"); >> wordDelimInit.put("generateNumberParts", "1"); >> wordDelimInit.put("catenateWords", "1"); >> wordDelimInit.put("catenateWords", "1"); >> wordDelimInit.put("catenateNumbers", "1"); >> wordDelimInit.put("catenateAll", "0"); >> wordDelimInit.put("splitOnCaseChange", "1"); >> >> WordDelimiterFilterFactory wordDelimiterFilterFactory = new >> WordDelimiterFilterFactory(); >> wordDelimiterFilterFactory.init(wordDelimInit); >> HashMap<String, String> porterInit = new HashMap<String, >> String>(); >> porterInit.put("protected", "protwords.txt"); >> EnglishPorterFilterFactory englishPorterFilterFactory = new >> EnglishPorterFilterFactory(); >> englishPorterFilterFactory.init(porterInit); >> >> return new >> RemoveDuplicatesTokenFilter(englishPorterFilterFactory.create(new >> LowerCaseFilter(wordDelimiterFilterFactory.create(stopFilterFactory.create(new >> WhitespaceTokenizer(reader)))))); >> } >> } >> >> On Tue, Jul 5, 2011 at 1:00 PM, Gabriele Kahlout < >> gabri...@mysimpatico.com> wrote: >> >>> nice...where? >>> >>> I'm trying to figure out 2 things: >>> 1) How to create an analyzer that corresponds to the one in the >>> schema.xml. >>> >>> <analyzer> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> <filter class="solr.WordDelimiterFilterFactory" >>> generateWordParts="1" generateNumberParts="1"/> >>> </analyzer> >>> >>> 2) I'd like to see the code that creates it reading it from schema.xml . >>> >>> >>> On Tue, Jul 5, 2011 at 12:33 PM, Markus Jelsma < >>> markus.jel...@openindex.io> wrote: >>> >>>> No. SolrJ only builds input docs from NutchDocument objects. Solr will >>>> do >>>> analysis. The integration is analogous to XML post of Solr documents. >>>> >>>> On Tuesday 05 July 2011 12:28:21 Gabriele Kahlout wrote: >>>> > Hello, >>>> > >>>> > I'm trying to understand better Nutch and Solr integration. My >>>> > understanding is that Documents are added to Solr index from >>>> SolrWriter's >>>> > write(NutchDocument doc) method. But does it make any use of the >>>> > WhitespaceTokenizerFactory? >>>> >>>> -- >>>> Markus Jelsma - CTO - Openindex >>>> http://www.linkedin.com/in/markus17 >>>> 050-8536620 / 06-50258350 >>>> >>> >>> >>> >>> -- >>> Regards, >>> K. Gabriele >>> >>> --- unchanged since 20/9/10 --- >>> P.S. If the subject contains "[LON]" or the addressee acknowledges the >>> receipt within 48 hours then I don't resend the email. >>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >>> time(x) < Now + 48h) ⇒ ¬resend(I, this). >>> >>> If an email is sent by a sender that is not a trusted contact or the >>> email does not contain a valid code then the email is not received. A valid >>> code starts with a hyphen and ends with "X". >>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >>> L(-[a-z]+[0-9]X)). >>> >>> >> >> >> -- >> Regards, >> K. Gabriele >> >> --- unchanged since 20/9/10 --- >> P.S. If the subject contains "[LON]" or the addressee acknowledges the >> receipt within 48 hours then I don't resend the email. >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >> time(x) < Now + 48h) ⇒ ¬resend(I, this). >> >> If an email is sent by a sender that is not a trusted contact or the email >> does not contain a valid code then the email is not received. A valid code >> starts with a hyphen and ends with "X". >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >> L(-[a-z]+[0-9]X)). >> >> > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).