Re: Total number of terms in an index?

Michael McCandless Tue, 27 Jul 2010 11:16:04 -0700

In trunk (flex) you can ask each segment for its unique term count.

But to compute the unique term count across all segments is
necessarily costly (requires merging them, to de-dup), as Hoss
described.


Mike

On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Hi Jason,
>
> Are you looking for the total number of unique terms or total number of term 
> occurrences?
>
> Checkindex reports both, but does a bunch of other work so is probably not 
> the fastest.
>
> If you are looking for total number of term occurrences, you might look at 
> contrib/org/apache/lucene/misc/HighFreqTerms.java.
>
> If you are just looking for the total number of unique terms, I wonder if 
> there is some low level API that would allow you to just access the in-memory 
> representation of the tii file and then multiply the number of terms in it by 
> your indexDivisor (default 128). I haven't dug in to the code so I don't 
> actually know how the tii file gets loaded into a data structure in memory.  
> If there is api access, it seems like this might be the quickest way to get 
> the number of unique terms.  (Of course you would have to do this for each 
> segment).
>
> Tom
> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Monday, July 26, 2010 8:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Total number of terms in an index?
>
>
> : Sorry, like the subject, I mean the total number of terms.
>
> it's not stored anywhere, so the only way to fetch it is to actually
> iteate all of the terms and count them (that's why LukeRequestHandler is
> slow slow to compute this particular value)
>
> If i remember right, someone mentioned at one point that flex would let
> you store data about stuff like this in your index as part of the segment
> writing, but frankly i'm still not sure how that iwll help -- because you
> unless your index is fully optimized, you still have to iterate the terms
> in each segment to 'de-dup' them.
>
>
> -Hoss
>
>

Re: Total number of terms in an index?

Reply via email to