Re: Highlighting Performance On Large Documents

Serdar Sahin Sat, 08 May 2010 07:23:38 -0700

Hi,

Thanks a lot for the replies, I could have chance today to test them.


First of all termVectors/termPositions/termOffsets did not help, it has very
little effect, but I tried a workaroud, however it is not as efficient as I
thought.

>From these fields;

        <field name="title" type="text" indexed="true" stored="true"
required="true" omitNorms="true"/>
        <field name="description" type="text" indexed="true" stored="true"
/>
        <field name="tags" type="text" indexed="true" stored="true"
omitNorms="true" />
        <field name="plainText" type="text" indexed="true" stored="false"/>

I tried to create copyfield
        <field name="all_text" type="text" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

        <copyField source="title" dest="all_text" />
        <copyField source="tags" dest="all_text" />
        <copyField source="plainText" dest="all_text"  maxChars="20000"/>
        <copyField source="description" dest="all_text" />

And I have indexed 1000 documents that have more than 200 pages.

However, maxChars directive also limited the character limit for indexed
field. For the query of "Institute of Information Systems", it gave 12
results. I also tried to get unique words from bottom of the text files and
search them, they did not give any result. So,  I just wanted to limit size
of the stored field, but did not work. Then I tried to create two copyfields

       <field name="all_text" type="text" indexed="true" stored="false"
multiValued="true" />
       <field name="short_text" type="text" indexed="true" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />

        <copyField source="title" dest="all_text" />
        <copyField source="tags" dest="all_text" />
        <copyField source="plainText" dest="all_text" />
        <copyField source="description" dest="all_text" />

        <copyField source="title" dest="short_text" />
        <copyField source="tags" dest="short_text" />
        <copyField source="plainText" dest="short_text" maxChars="20000"/>
        <copyField source="description" dest="short_text" />

It gave 168 results, as I expected, and highlighting also worked reasonably
fast.However, I don't know Solr/Lucene internals but I guess I store the
same indexed field twice, and it should have effect on the performance and
storage. I tried to make it false but then it gave this error:

Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting
indexed field options:

So, it was not possible.

Then I tried hl.useFastVectorHighlighter with the latest version (and yes I
have turned on termVectors, termPositions, termOffsets) but the result was
2-2.5x slower, which was very strange. Do you have any guess why this might
have happened?

1. Provide another field for highlighting and use copyField

to copy plainText to the highlighting field. When using copyField,

specify maxChars attribute to limit the length of the copy of plainText.

This should work on Solr 1.4.


Do you mean I need to duplicate these four fields and use one of them for
storing and other one for indexing? Than I guess, I can do;

      <field name="all_text" type="text" indexed="true" stored="false"
multiValued="true" />
       <field name="short_text" type="text" indexed="false" stored="true"
multiValued="true" termVectors="true" termPositions="true"
termOffsets="true" />  (indexing false)

Is this the only solution? What are the effects on index size and
performance? Could you give me an advice?

Thanks,

Serdar



On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <[email protected]> wrote:

> Do you have these options turned on when you index the text field:
> termVectors/termPositions/termOffsets ?
>
> Highlighting needs the information created by these anlysis options.
> If they are not turned on, Solr has load the document text and run the
> analyzer again with these options on, uses that data to create the
> highlighting, then throws away the reanalyzed data. Without these
> options, you are basically re-indexing the document when you highlight
> it.
>
>
> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase
>
> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <[email protected]> wrote:
> > (10/05/05 22:08), Serdar Sahin wrote:
> >>
> >> Hi,
> >>
> >> Currently, there are similar topics active in the mailing list, but it I
> >> did
> >> not want to steal the topic.
> >>
> >> I have currently indexed 100.000 documents, they are microsoft
> office/pdf
> >> etc documents I convert them to TXT files before indexing. Files are
> >> between
> >> 1-500 pages. When I search something and filter it to retrieve documents
> >> that has more than 100 pages, and activate highlighting, it takes 0.8-3
> >> seconds, depending on the query. (10 result per page) If I retrieve
> >> documents that has 1-5 pages, it drops to 0.1 seconds.
> >>
> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the
> large
> >> documents, which is more than enough. This problem mostly happens where
> >> there are no caches, on the first query. I use this configuration for
> >> highlighting:
> >>
> >>
> >>
>  $query->addHighlightField('description')->addHighlightField('plainText');
> >>     $query->setHighlightSimplePre('<strong>');
> >>     $query->setHighlightSimplePost('</strong>');
> >>     $query->setHighlightHighlightMultiTerm(TRUE);
> >>     $query->setHighlightMaxAnalyzedChars(10000);
> >>     $query->setHighlightSnippets(2);
> >>
> >> Do you have any suggestions to improve response time while highlighting
> is
> >> active? I have read couple of articles you have previously provided but
> >> they
> >> did not help.
> >>
> >> And for the second question, I retrieve these fields:
> >>
> >>     $query->addField('title')->addField('cat')->addField('thumbs_up')->
> >>             addField('thumbs_down')->addField('lang')->addField('id')->
> >>
> >>  addField('username')->addField('view_count')->addField('pages')->
> >>             addField('no_img')->addField('date');
> >>
> >> If I can't solve the highlighting problem on large documents, I can
> simply
> >> disable it and retrieve first x characters from the plainText (full
> text)
> >> field, but is it possible to retrieve first x characters without using
> the
> >> highlighting feature? When I use this;
> >>     $query->setHighlight(TRUE);
> >>     $query->setHighlightAlternateField('plainText');
> >>     $query->setHighlightMaxAnalyzedChars(0);
> >>     $query->setHighlightMaxAlternateFieldLength(256);
> >>
> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300 pages.
> The
> >> highlighting still works so it might be the source of the problem, I
> want
> >> to
> >> completely disable it and retrieve only the first 256 characters of the
> >> plainText field. Is it possible? It may remove some overhead give better
> >> performance.
> >>
> >> I personally prefer the highlighting solution but I also would like to
> >> hear
> >> the solution for this problem. For the same query, if I disable
> >> highlighting
> >> and without retrieving (but still searching) the plainText field, it
> drops
> >> to 0.0094 seconds. So I think if I can get the first 256 characters
> >> without
> >> using the highlighting, I will get better performance.
> >>
> >> Any suggestions regarding with these two problems will highly
> appreciated.
> >>
> >> Thanks,
> >>
> >> Serdar Sahin
> >>
> >>
> >
> > Hi Serdar,
> >
> > There are a few things I think of you can try.
> >
> > 1. Provide another field for highlighting and use copyField
> > to copy plainText to the highlighting field. When using copyField,
> > specify maxChars attribute to limit the length of the copy of plainText.
> > This should work on Solr 1.4.
> >
> > 2. If you can use branch_3x version of Solr, try FastVectorHighlighter.
> >
> > Koji
> >
> > --
> > http://www.rondhuit.com/en/
> >
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: Highlighting Performance On Large Documents

Reply via email to