In trunk (flex) you can ask each segment for its unique term count.

But to compute the unique term count across all segments is
necessarily costly (requires merging them, to de-dup), as Hoss
described.

Mike

On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Hi Jason,
>
> Are you looking for the total number of unique terms or total number of term 
> occurrences?
>
> Checkindex reports both, but does a bunch of other work so is probably not 
> the fastest.
>
> If you are looking for total number of term occurrences, you might look at 
> contrib/org/apache/lucene/misc/HighFreqTerms.java.
>
> If you are just looking for the total number of unique terms, I wonder if 
> there is some low level API that would allow you to just access the in-memory 
> representation of the tii file and then multiply the number of terms in it by 
> your indexDivisor (default 128). I haven't dug in to the code so I don't 
> actually know how the tii file gets loaded into a data structure in memory.  
> If there is api access, it seems like this might be the quickest way to get 
> the number of unique terms.  (Of course you would have to do this for each 
> segment).
>
> Tom
> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Monday, July 26, 2010 8:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Total number of terms in an index?
>
>
> : Sorry, like the subject, I mean the total number of terms.
>
> it's not stored anywhere, so the only way to fetch it is to actually
> iteate all of the terms and count them (that's why LukeRequestHandler is
> slow slow to compute this particular value)
>
> If i remember right, someone mentioned at one point that flex would let
> you store data about stuff like this in your index as part of the segment
> writing, but frankly i'm still not sure how that iwll help -- because you
> unless your index is fully optimized, you still have to iterate the terms
> in each segment to 'de-dup' them.
>
>
> -Hoss
>
>

Reply via email to