Hello,

earlier, I was trying to retrieve the total token count per index
http://lucene.472066.n3.nabble.com/how-to-retrieve-total-token-count-per-collection-index-td4000161.html
.

now, I would like to have a token (word) count within the document-set (resulting of a query),
both for the matching word and as sum of all tokens of matching documents.

The ultimate goal is to be able to compute relative frequencies of terms, on token-base instead of per article base.

so if I search for word "Haus" within a subcollection (defined by a separate query) and the word appears in a matching doc A 2 times and doc B 5 times, i need as hit-count: 7 not 2.

+ if the subcollection contains documents
A with 300 tokens (i.e. running words, not different terms)
B with 100 tokens
C with 50 tokens

I also need this second sum, i.e. 450.

I plan to get the second number by first
preprocessing the document counting the tokens
storing the number in a separate field,
then applying the statsComponent,
which will deliver me the sum for given query/subcollection.

for the first number, i could use the termfreq() function,
but that gives me only the term frequency per document.

So, before I iterate over the whole result, to sum it,
I wonder, if the statsComponent would be able to perform the counting also over a dynamic field (the result of the function).
I tried this:
/solr/select/?fq=docsrc:falter&q={!func}tf(inhalt,'haus')&stats=true&stats.field=score&rows=10&indent=true&fl=score&debugQuery=true

but got the error:
<str name="msg">Field type text_de{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100}} is not currently supported</str>

Or is there any other way?

If I understand it correctly, any of tf(), idf(), sttf(), wouldn't be of any help here neither.

Thanks in advance

best,
matej


Reply via email to