Re: Highlighting Performance On Large Documents

Lance Norskog Sat, 08 May 2010 16:25:09 -0700

If you want to highlight field X, doing the
termOffsets/termPositions/termVectors will make highlighting that
field faster. You should make a separate field and apply these options
to that field.


Now: doing a copyfield adds a "value" to a multiValued field. For a
text field, you get a multi-valued text field. You should only copy
one value to the highlighted field, so just copyField the document to
your special field. To enforce this, I would add multiValued="false"
to that field, just to avoid mistakes.

So, all_text should be indexed without the term* attributes, and
should not be stored. Then your document stored in a separate field
that you use for highlighting and has the term* attributes.

In general, highlighting has been a problem area all along and there
are little edge cases that I don't know how to solve.

On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com> wrote:
> Hi,
>
> Thanks a lot for the replies, I could have chance today to test them.
>
> First of all termVectors/termPositions/termOffsets did not help, it has very
> little effect, but I tried a workaroud, however it is not as efficient as I
> thought.
>
> From these fields;
>
>        <field name="title" type="text" indexed="true" stored="true"
> required="true" omitNorms="true"/>
>        <field name="description" type="text" indexed="true" stored="true"
> />
>        <field name="tags" type="text" indexed="true" stored="true"
> omitNorms="true" />
>        <field name="plainText" type="text" indexed="true" stored="false"/>
>
> I tried to create copyfield
>        <field name="all_text" type="text" indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>
>        <copyField source="title" dest="all_text" />
>        <copyField source="tags" dest="all_text" />
>        <copyField source="plainText" dest="all_text"  maxChars="20000"/>
>        <copyField source="description" dest="all_text" />
>
> And I have indexed 1000 documents that have more than 200 pages.
>
> However, maxChars directive also limited the character limit for indexed
> field. For the query of "Institute of Information Systems", it gave 12
> results. I also tried to get unique words from bottom of the text files and
> search them, they did not give any result. So,  I just wanted to limit size
> of the stored field, but did not work. Then I tried to create two copyfields
>
>       <field name="all_text" type="text" indexed="true" stored="false"
> multiValued="true" />
>       <field name="short_text" type="text" indexed="true" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>
>        <copyField source="title" dest="all_text" />
>        <copyField source="tags" dest="all_text" />
>        <copyField source="plainText" dest="all_text" />
>        <copyField source="description" dest="all_text" />
>
>        <copyField source="title" dest="short_text" />
>        <copyField source="tags" dest="short_text" />
>        <copyField source="plainText" dest="short_text" maxChars="20000"/>
>        <copyField source="description" dest="short_text" />
>
> It gave 168 results, as I expected, and highlighting also worked reasonably
> fast.However, I don't know Solr/Lucene internals but I guess I store the
> same indexed field twice, and it should have effect on the performance and
> storage. I tried to make it false but then it gave this error:
>
> Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting
> indexed field options:
>
> So, it was not possible.
>
> Then I tried hl.useFastVectorHighlighter with the latest version (and yes I
> have turned on termVectors, termPositions, termOffsets) but the result was
> 2-2.5x slower, which was very strange. Do you have any guess why this might
> have happened?
>
> 1. Provide another field for highlighting and use copyField
>
> to copy plainText to the highlighting field. When using copyField,
>
> specify maxChars attribute to limit the length of the copy of plainText.
>
> This should work on Solr 1.4.
>
>
> Do you mean I need to duplicate these four fields and use one of them for
> storing and other one for indexing? Than I guess, I can do;
>
>      <field name="all_text" type="text" indexed="true" stored="false"
> multiValued="true" />
>       <field name="short_text" type="text" indexed="false" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />  (indexing false)
>
> Is this the only solution? What are the effects on index size and
> performance? Could you give me an advice?
>
> Thanks,
>
> Serdar
>
>
>
> On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com> wrote:
>
>> Do you have these options turned on when you index the text field:
>> termVectors/termPositions/termOffsets ?
>>
>> Highlighting needs the information created by these anlysis options.
>> If they are not turned on, Solr has load the document text and run the
>> analyzer again with these options on, uses that data to create the
>> highlighting, then throws away the reanalyzed data. Without these
>> options, you are basically re-indexing the document when you highlight
>> it.
>>
>>
>> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase
>>
>> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp> wrote:
>> > (10/05/05 22:08), Serdar Sahin wrote:
>> >>
>> >> Hi,
>> >>
>> >> Currently, there are similar topics active in the mailing list, but it I
>> >> did
>> >> not want to steal the topic.
>> >>
>> >> I have currently indexed 100.000 documents, they are microsoft
>> office/pdf
>> >> etc documents I convert them to TXT files before indexing. Files are
>> >> between
>> >> 1-500 pages. When I search something and filter it to retrieve documents
>> >> that has more than 100 pages, and activate highlighting, it takes 0.8-3
>> >> seconds, depending on the query. (10 result per page) If I retrieve
>> >> documents that has 1-5 pages, it drops to 0.1 seconds.
>> >>
>> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the
>> large
>> >> documents, which is more than enough. This problem mostly happens where
>> >> there are no caches, on the first query. I use this configuration for
>> >> highlighting:
>> >>
>> >>
>> >>
>>  $query->addHighlightField('description')->addHighlightField('plainText');
>> >>     $query->setHighlightSimplePre('<strong>');
>> >>     $query->setHighlightSimplePost('</strong>');
>> >>     $query->setHighlightHighlightMultiTerm(TRUE);
>> >>     $query->setHighlightMaxAnalyzedChars(10000);
>> >>     $query->setHighlightSnippets(2);
>> >>
>> >> Do you have any suggestions to improve response time while highlighting
>> is
>> >> active? I have read couple of articles you have previously provided but
>> >> they
>> >> did not help.
>> >>
>> >> And for the second question, I retrieve these fields:
>> >>
>> >>     $query->addField('title')->addField('cat')->addField('thumbs_up')->
>> >>             addField('thumbs_down')->addField('lang')->addField('id')->
>> >>
>> >>  addField('username')->addField('view_count')->addField('pages')->
>> >>             addField('no_img')->addField('date');
>> >>
>> >> If I can't solve the highlighting problem on large documents, I can
>> simply
>> >> disable it and retrieve first x characters from the plainText (full
>> text)
>> >> field, but is it possible to retrieve first x characters without using
>> the
>> >> highlighting feature? When I use this;
>> >>     $query->setHighlight(TRUE);
>> >>     $query->setHighlightAlternateField('plainText');
>> >>     $query->setHighlightMaxAnalyzedChars(0);
>> >>     $query->setHighlightMaxAlternateFieldLength(256);
>> >>
>> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300 pages.
>> The
>> >> highlighting still works so it might be the source of the problem, I
>> want
>> >> to
>> >> completely disable it and retrieve only the first 256 characters of the
>> >> plainText field. Is it possible? It may remove some overhead give better
>> >> performance.
>> >>
>> >> I personally prefer the highlighting solution but I also would like to
>> >> hear
>> >> the solution for this problem. For the same query, if I disable
>> >> highlighting
>> >> and without retrieving (but still searching) the plainText field, it
>> drops
>> >> to 0.0094 seconds. So I think if I can get the first 256 characters
>> >> without
>> >> using the highlighting, I will get better performance.
>> >>
>> >> Any suggestions regarding with these two problems will highly
>> appreciated.
>> >>
>> >> Thanks,
>> >>
>> >> Serdar Sahin
>> >>
>> >>
>> >
>> > Hi Serdar,
>> >
>> > There are a few things I think of you can try.
>> >
>> > 1. Provide another field for highlighting and use copyField
>> > to copy plainText to the highlighting field. When using copyField,
>> > specify maxChars attribute to limit the length of the copy of plainText.
>> > This should work on Solr 1.4.
>> >
>> > 2. If you can use branch_3x version of Solr, try FastVectorHighlighter.
>> >
>> > Koji
>> >
>> > --
>> > http://www.rondhuit.com/en/
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Highlighting Performance On Large Documents

Reply via email to