As my tests show about 1/4 documents are relevant for scoring per query. So
for my example with 100000 stacktraces in the index i need to score 25000
documents. I have a native implementation of the scoring algorithm which
scores all 100000. That needs about 20ms. The lucene implementation needs
for the same query >100ms what really sucks. Without retrieving fields it
needs about 6ms - thats also what my target should be.
I tried without LAZY_LOAD, but there is no real difference. How can i sort
by docIds first?
FieldCache.DEFAULT.getStrings ist not a possibility cause of to the memory
problem.
This is how i store frames:
for(StacktraceFrame frame : stacktrace.getFrames()) {
doc.add(new Field(FIELD_FRAMES,
frame.getClassName()+"."+frame.getMethod(), Store.YES, Index.NOT_ANALYZED));
}
2010/9/9 Michael McCandless <[email protected]>
> What a neat search engine! (Searching stack traces).
>
> Unfortunately, loading stored fields is slowish -- it entails 2 disk
> seeks under the hood. Really you should retrieve at most a page worth
> of docs, in the serial path of a query. How many are you retrieving
> per query?
>
> That said, you shouldn't use LAZY_LOAD if you know you will need the
> value. Also, it's possible that sorting the docIDs (ascending) first
> may get you better performance since your load is then a single scan
> of the 2 files in the index.
>
> You may want to use FieldCache.DEFAULT.getStrings instead -- this
> gives you a very fast String[], but, may suck up tons of memory
> depending on how many unique frames there are (how do you index each
> frame?).
>
> Mike
>
> On Thu, Sep 9, 2010 at 4:01 AM, Johannes Lerch
> <[email protected]> wrote:
> > Hi,
> >
> > i am working on a search for stacktraces. To do this i implemented my own
> > Query, Weight and Scorer. I save exception, method and the frames as
> fields
> > in the index and am able to pick relevant documents by matching those
> fields
> > with my query stacktrace (using IndexReader.termDocs()). I implemented my
> > own scoring which is calculated pairwise for stacktraces (the one of the
> > query and each of the relevant documents). For this scoring i calculate a
> > similarity between both traces by comparing the frames if they exist in
> both
> > and also check for ordering. This works similar as diff on text/source
> code.
> > My problem is, that i need all frames contained in both stacktraces, so i
> > have to retrieve all frame fields of the stored stacktraces. For now i do
> > this with:
> > Document document = reader.document(doc, new FieldSelector() {
> > @Override
> > public FieldSelectorResult accept(String fieldName) {
> > if(Indexer.FIELD_FRAMES.equals(fieldName))
> > return FieldSelectorResult.LAZY_LOAD;
> > else
> > return FieldSelectorResult.NO_LOAD;
> > }
> > });
> > Fieldable[] fieldables = document.getFieldables(Indexer.FIELD_FRAMES);
> >
> > But this call really decreases performance to something which is not
> > agreeable for me (>10 times slower on 100000 stacktraces in index). So my
> > question is, are there are other ways to get stored fields or do you have
> > ideas for workarounds. Would it be better to store all stacktraces in a
> > database and retrieve them from there? If so how do i get the docId of
> > stacktraces i wrote to the index?
> >
> > Regards,
> > Johannes
> >
>