RE: How to read values of a field efficiently

Martin Grotzke Mon, 20 Aug 2007 02:52:03 -0700

On Sun, 2007-08-19 at 21:39 +0200, Ard Schrijvers wrote:
> > On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote:
> > > : Is it possible to get the values from the ValueSource (or from
> > > : getFieldCacheCounts) sorted by its natural order (from lowest to
> > > : highest values)?
> > > 
> > > well, an inverted term index is already a data structure 
> > > listing terms
> > > from lowest to highest and the associated documents -- so 
> > > if you want to
> > > iterate from low to high between a range and find matching 
> > > docs you should
> > > just use hte TermEnum
> > Ok. Unfortunately I don't see how I can get a TermEnum for a specific
> > field (e.g. "price")... I tried
> > 
> > TermEnum te = searcher.getReader().terms(new Term(field, ""));
> > 
> > but this returns also terms for several other fields.
> 
> correct, see 
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms()
> 
> > Is it possible at all to get a TermEnum for a specific field?
> 
> AFAIK not directly. Normally, I use something like:
> 
> TermEnum terms = searcher.getReader().terms(new Term(field, ""));
>       while (terms.term() != null && terms.term().field() == field){
>               //do things                               
>               terms.next();
>       }
Now I implemented it like this:


    String startTerm = prefix==null ? "" : ft.toInternal(prefix);
    TermEnum te = searcher.getReader().terms(new Term(field, startTerm));
    final int[] prices = new int[docs.size()];
    int i = 0;
    int skipped = 0;
    while( te.next() ) {
        final Term term = te.term();
        if ( term == null || !term.field().equals( field ) ) {
            skipped++;
            continue;
        }
        final String termText = term.text();
        int count = searcher.numDocs(new TermQuery(term), docs);
        int value = (int) NumberUtils.SortableStr2float(termText);
        for( int j = 0; j < count; j++ ) {
            prices[i++] = value;
        }
    }

Unfortunately this takes ~1,5 sec in my case (~2M docs, result/docset contains
1900280 results, 1026683 terms skipped because they didn't match the field).

Is there anything which could be optimized / which is wrong?

If not I would have to check which other way I could go (based on ValueSource
or what else)...

Thanx && cheers,
Martin


> 
> > 
> > Then if I had this TermEnum, how can I check if a Term is in my
> > DocSet? In other words, I would like to read Terms for a specific
> > field from my DocSet - so that I could determine all price terms
> > for my DocSet.
> 
> Is your DocSet some sort of filter? if so, in your while loop you can fill a 
> new Filter, like
> 
> BitSet docFilter = new BitSet(reader.maxDoc());
> 
> and in the while loop:
> 
>       docs.seek(terms);
>       while (docs.next()) {
>          docFilter.set(docs.doc());
>       }
> 
> If your DocSet is not a BitSet you might be able to construct one for it,
> 
> Regards Ard
> 
> > 
> > Is there a way to achieve this?
> > 
> > Thanx in advance,
> > cheers,
> > Martin
> > 
> > 
> > >  -- the whole point of the FieldCache (and
> > > FieldCacheSource) is to have a "reverse inverted index" so 
> > you can quickly
> > > fetch the indexed value if you know the docId.
> > > 
> > > perhaps you should elaborate a little more on what it is 
> > you are trying to
> > > do so we can help you figure out how to do it more 
> > efficinelty ... i know
> > > you mentioend computing price ranges in your first message 
> > ... but you
> > > also didn't post any clear code about that part of your 
> > problem, just that
> > > the *other* part of your code that iterated over every doc 
> > was too slow
> > > ... perhaps you shouldn't be iterating over every doc to 
> > figure out your
> > > ranges .. perhaps you can iterate over the terms themselves?
> > > 
> > > 
> > > hang on ... rereading your first message i just noticed something i
> > > definitely didn't spot before...
> > > 
> > > >> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
> > > >> for the second request, while reading prices takes ~600 ms.
> > > 
> > > ...i clearly missed this, and fixated on your assertion 
> > that your reading
> > > of field values took longer then the stock methods -- but 
> > you're not just
> > > comparing the time needed byu different methods, you're also timing
> > > different fields.
> > > 
> > > this actually makes a lot of sense since there are probably 
> > a lot fewer
> > > unique values for the cat field, so there are a lot fewer 
> > discrete values
> > > to deal with when computing counts.
> > > 
> > > 
> > > 
> > > 
> > > -Hoss
> > > 
> > -- 
> > Martin Grotzke
> > http://www.javakaffee.de/blog/
> > 
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

signature.asc
Description: This is a digitally signed message part

RE: How to read values of a field efficiently

Reply via email to