UIMA: I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 Now I am able to use this analyzer for english texts and filter (un)wanted token types :-)
<fieldType name="uima_nouns_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" descriptorPath="/uima/AggregateSentenceAE.xml" tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/> <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" /> </analyzer> </fieldType> Open issue -> How to set the ModelFile for the Tagger to "german/TuebaModel.dat" ??? OpenNLP: And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is now working with solr 4.1. :-) <fieldType name="nlp_nouns_de" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="opennlp/de-token.bin" /> <filter class="solr.OpenNLPFilterFactory" posTaggerModel="opennlp/de-pos-maxent.bin" /> <filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,FM" keepPayloads="true"/> <filter class="solr.StripPayloadsFilterFactory"/> </analyzer> </fieldType> Any hints on which lib is more accurate on noun tagging? Any performance or memory issues (some OOM here while testing with 1GB via Analyzer Admin GUI)? Regards, Kai Gülzau -----Original Message----- From: Kai Gülzau [mailto:kguel...@novomind.com] Sent: Thursday, January 31, 2013 2:19 PM To: solr-user@lucene.apache.org Subject: Indexing nouns only - UIMA vs. OpenNLP Hi, I am stuck trying to index only the nouns of german and english texts. (very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example) First try was to use UIMA with the HMMTagger: <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory"> <lst name="uimaConfig"> <lst name="runtimeParameters"></lst> <str name="analysisEngine">/org/apache/uima/desc/AggregateSentenceAE.xml</str> <bool name="ignoreErrors">false</bool> <lst name="analyzeFields"> <bool name="merge">false</bool> <arr name="fields"><str>albody</str></arr> </lst> <lst name="fieldMappings"> <lst name="type"> <str name="name">org.apache.uima.SentenceAnnotation</str> <lst name="mapping"> <str name="feature">coveredText</str> <str name="field">albody2</str> </lst> </lst> </lst> </lst> </processor> - But how do I set the ModelFile to use the german corpus? - What about language identification? -- How do I use the right corpus/tagger based on the language? -- Should this be done in UIMA (how?) or via solr contrib/langid field mapping? - How to remove non nouns in the annotated field? Second try is to use OpenNLP and to apply the patch https://issues.apache.org/jira/browse/LUCENE-2899 But the patch seems to be a bit out of date. Currently I try to get it to work with solr 4.1. Any pointers appreciated :-) Regards, Kai Gülzau