> Thank you.
>
> While interesting what I'm really after is a programmatic
> way to get at
> multi-word terms and their frequencies from a given
> document.
>
> Is this possible?
>
What do you mean by programmatic way? You mean without indexing? Multi-word
terms means phrases right? Like "tap water"?
you can use this field type to index your documents.
<fieldType name="shingle_text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="false"/>
</analyzer>
</fieldType>
and if you register TermsComponent in solrconfig.xml by doing:
<searchComponent name="termsComponent"
class="org.apache.solr.handler.component.TermsComponent"/>
<requestHandler name="/terms"
class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<bool name="terms">true</bool>
<str name="terms.fl">shingle_text_field</str>
</lst>
<arr name="components">
<str>termsComponent</str>
</arr>
</requestHandler>
http://localhost:8983/solr/terms will give you multi-word terms sorted by term
frequency. Also you can use TermVectorComponent to get term frequencies of
multi-terms of a particular document.
Additionally admin/schema.jsp shows top n terms if you want.