On Sun, 2007-08-19 at 21:39 +0200, Ard Schrijvers wrote: > > On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote: > > > : Is it possible to get the values from the ValueSource (or from > > > : getFieldCacheCounts) sorted by its natural order (from lowest to > > > : highest values)? > > > > > > well, an inverted term index is already a data structure > > > listing terms > > > from lowest to highest and the associated documents -- so > > > if you want to > > > iterate from low to high between a range and find matching > > > docs you should > > > just use hte TermEnum > > Ok. Unfortunately I don't see how I can get a TermEnum for a specific > > field (e.g. "price")... I tried > > > > TermEnum te = searcher.getReader().terms(new Term(field, "")); > > > > but this returns also terms for several other fields. > > correct, see > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexReader.html#terms() > > > Is it possible at all to get a TermEnum for a specific field? > > AFAIK not directly. Normally, I use something like: > > TermEnum terms = searcher.getReader().terms(new Term(field, "")); > while (terms.term() != null && terms.term().field() == field){ > //do things > terms.next(); > } Now I implemented it like this:
String startTerm = prefix==null ? "" : ft.toInternal(prefix); TermEnum te = searcher.getReader().terms(new Term(field, startTerm)); final int[] prices = new int[docs.size()]; int i = 0; int skipped = 0; while( te.next() ) { final Term term = te.term(); if ( term == null || !term.field().equals( field ) ) { skipped++; continue; } final String termText = term.text(); int count = searcher.numDocs(new TermQuery(term), docs); int value = (int) NumberUtils.SortableStr2float(termText); for( int j = 0; j < count; j++ ) { prices[i++] = value; } } Unfortunately this takes ~1,5 sec in my case (~2M docs, result/docset contains 1900280 results, 1026683 terms skipped because they didn't match the field). Is there anything which could be optimized / which is wrong? If not I would have to check which other way I could go (based on ValueSource or what else)... Thanx && cheers, Martin > > > > > Then if I had this TermEnum, how can I check if a Term is in my > > DocSet? In other words, I would like to read Terms for a specific > > field from my DocSet - so that I could determine all price terms > > for my DocSet. > > Is your DocSet some sort of filter? if so, in your while loop you can fill a > new Filter, like > > BitSet docFilter = new BitSet(reader.maxDoc()); > > and in the while loop: > > docs.seek(terms); > while (docs.next()) { > docFilter.set(docs.doc()); > } > > If your DocSet is not a BitSet you might be able to construct one for it, > > Regards Ard > > > > > Is there a way to achieve this? > > > > Thanx in advance, > > cheers, > > Martin > > > > > > > -- the whole point of the FieldCache (and > > > FieldCacheSource) is to have a "reverse inverted index" so > > you can quickly > > > fetch the indexed value if you know the docId. > > > > > > perhaps you should elaborate a little more on what it is > > you are trying to > > > do so we can help you figure out how to do it more > > efficinelty ... i know > > > you mentioend computing price ranges in your first message > > ... but you > > > also didn't post any clear code about that part of your > > problem, just that > > > the *other* part of your code that iterated over every doc > > was too slow > > > ... perhaps you shouldn't be iterating over every doc to > > figure out your > > > ranges .. perhaps you can iterate over the terms themselves? > > > > > > > > > hang on ... rereading your first message i just noticed something i > > > definitely didn't spot before... > > > > > > >> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms > > > >> for the second request, while reading prices takes ~600 ms. > > > > > > ...i clearly missed this, and fixated on your assertion > > that your reading > > > of field values took longer then the stock methods -- but > > you're not just > > > comparing the time needed byu different methods, you're also timing > > > different fields. > > > > > > this actually makes a lot of sense since there are probably > > a lot fewer > > > unique values for the cat field, so there are a lot fewer > > discrete values > > > to deal with when computing counts. > > > > > > > > > > > > > > > -Hoss > > > > > -- > > Martin Grotzke > > http://www.javakaffee.de/blog/ > > > -- Martin Grotzke http://www.javakaffee.de/blog/
signature.asc
Description: This is a digitally signed message part