UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

<fieldType name="uima_nouns_en" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
      descriptorPath="/uima/AggregateSentenceAE.xml" 
tokenType="org.apache.uima.TokenAnnotation"
      featurePath="posTag"/>
    <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" />
  </analyzer>
</fieldType>

Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)

<fieldType name="nlp_nouns_de" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory" 
tokenizerModel="opennlp/de-token.bin" />
      <filter class="solr.OpenNLPFilterFactory" 
posTaggerModel="opennlp/de-pos-maxent.bin" />
      <filter class="solr.FilterPayloadsFilterFactory" 
payloadList="NN,NNS,NNP,NNPS,FM" keepPayloads="true"/>
      <filter class="solr.StripPayloadsFilterFactory"/>
  </analyzer>
</fieldType>



Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-----Original Message-----
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:

<processor 
class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
  <lst name="uimaConfig">
    <lst name="runtimeParameters"></lst>
    <str 
name="analysisEngine">/org/apache/uima/desc/AggregateSentenceAE.xml</str>
    <bool name="ignoreErrors">false</bool>
    <lst name="analyzeFields">
      <bool name="merge">false</bool>
      <arr name="fields"><str>albody</str></arr>
    </lst>
    <lst name="fieldMappings">
      <lst name="type">
        <str name="name">org.apache.uima.SentenceAnnotation</str>
        <lst name="mapping">
          <str name="feature">coveredText</str>
          <str name="field">albody2</str>
        </lst>
      </lst>
   </lst>
  </lst>
</processor>

- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau

Reply via email to