Thank you Ahmet, this is exactly what I was looking for. Looks like the shingle filter can produce 3+-gram terms as well, that's great. I'm going to try this with both western and CJK language tokenizers and see how it turns out.
On Tue, Feb 9, 2010 at 5:07 PM, Ahmet Arslan <iori...@yahoo.com> wrote: >> I've been looking at the Solr TermVectorComponent >> (http://wiki.apache.org/solr/TermVectorComponent) and it >> seems to have >> something similar to this, but it looks to me like this is >> a component >> that is processed at query time (?) and is limited to >> 1-gram terms. > > If you use <filter class="solr.ShingleFilterFactory" maxShingleSize="2" > outputUnigrams="false"/> it can give you info about 2-gram terms. > >> Also, the tf/idf scores are a little different as they come >> back in integer values as separate components. > > In wiki, example output only tf and df values - which are integer - are > displayed. You can calculate tf*idf (double) with these parameters: > > &qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true > > > >