For a rough estimate, square the number of unique terms to get the number of 
terms. Vocabulary usually goes up as the square root of the corpus size in 
words.

wunder

On Aug 9, 2012, at 7:20 AM, tech.vronk wrote:

> Hello,
> 
> I wonder how to figure out the total token count in a collection (per index), 
> i.e. the size of a corpus/collection measured in tokens.
> 
> The statistics in /admin tell the number of distinct terms,
> and the frequency list per index reveals the number of documents with given 
> term. So even if I would sum all the frequencies, I wouldn't get the result I 
> need.
> 
> Thank you.
> 
> best,
> Matej





Reply via email to