Re: help refactoring from 3.x to 4.x

Michael McCandless Mon, 23 Aug 2010 04:01:08 -0700

Spooky that you see incorrect results!  The code looks correct.  What
are the specifics on when it produces an invalid result?


Also spooky that you see it running slower -- how much slower?  Did
you rebuild the index in 4.x (if not, you are using the preflex
codec)?  And is the index otherwise identical?

You could improve perf by not using SolrIndexSearcher.numDocs?  Ie you
don't need the count; you just need to know if it's > 0.  So you could
make your own loop that breaks out on the first docID in common.  You
could also stick w/ BytesRef the whole time (only do .utf8ToString()
in the end on the first/last), though this is presumably a net/nets
tiny cost.

But, we should still dig down on why numDocs is slower in 4.x; that's
unexpected; Yonik any ideas?  I'm not familiar with this part of
Solr...

Mike

On Mon, Aug 23, 2010 at 2:38 AM, Ryan McKinley <ryan...@gmail.com> wrote:
> I have a function that works well in 3.x, but when I tried to
> re-implement in 4.x it runs very very slow (~20ms vs 45s on an index w
> ~100K items).
>
> Big picture, I am trying to calculate a bounding box for items that
> match the query.  To calculate this, I have two fields bboxNS, and
> bboxEW that get filled with the min and max values for that doc.  To
> get the bounding box, I just need the first matching term in the index
> and the last matching term.
>
> In 3.x the code looked like this:
>
> public class FirstLastMatchingTerm
> {
>  String first = null;
>  String last = null;
>
>  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
> String field, DocSet docs) throws IOException
>  {
>    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
>    if( docs.size() > 0 ) {
>      IndexReader reader = searcher.getReader();
>      TermEnum te = reader.terms(new Term(field,""));
>      do {
>        Term t = te.term();
>        if( null == t || !t.field().equals(field) ) {
>          break;
>        }
>
>        if( searcher.numDocs(new TermQuery(t), docs) > 0 ) {
>          firstLast.last = t.text();
>          if( firstLast.first == null ) {
>            firstLast.first = firstLast.last;
>          }
>        }
>      }
>      while( te.next() );
>    }
>    return firstLast;
>  }
> }
>
>
> In 4.x, I tried:
>
> public class FirstLastMatchingTerm
> {
>  String first = null;
>  String last = null;
>
>  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
> String field, DocSet docs) throws IOException
>  {
>    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
>    if( docs.size() > 0 ) {
>      IndexReader reader = searcher.getReader();
>
>      Terms terms = MultiFields.getTerms(reader, field);
>      TermsEnum te = terms.iterator();
>      BytesRef term = te.next();
>      while( term != null ) {
>        if( searcher.numDocs(new TermQuery(new Term(field,term)), docs) > 0 ) {
>          firstLast.last = term.utf8ToString();
>          if( firstLast.first == null ) {
>            firstLast.first = firstLast.last;
>          }
>        }
>        term = te.next();
>      }
>    }
>    return firstLast;
>  }
> }
>
> but the results are slow (and incorrect).  I tried some variations of
> using ReaderUtil.Gather(), but the real hit seems to come from
>  if( searcher.numDocs(new TermQuery(new Term(field,term)), docs) > 0 )
>
> Any ideas?  I'm not tied to the approach or indexing strategy, so if
> anyone has other suggestions that would be great.  Looking at it
> again, it seems crazy that you have to run a query for each term, but
> in 3.x
>
> thanks
> ryan
>

Re: help refactoring from 3.x to 4.x

Reply via email to