Re: Highlighting Performance On Large Documents

Lance Norskog Mon, 10 May 2010 13:32:27 -0700

To search in a field, it has to be indexed. You can store a field
without indexing if you want to highlight it. If you index it with the
term* options, it should highlight faster. Since these do not speed up
higlighting, your analysis stack is probably very simple. The term*
options are variations on indexing - if it is not indexed, these mean
nothing and Solr gives you the error message.


"highlighting 10 documents that have 200-400 A4 pages still takes
around 2 seconds,"

I have never seen terms/second or docs/second benchmarks for
highlighting. This performance is probably what I would expect.

On Sat, May 8, 2010 at 5:34 PM, Serdar Sahin <anlamar...@gmail.com> wrote:
> Hi,
>
> Sorry for the second e-mail, but for the duplication problem, I have done
> something wrong, ok now it works, and the query time reduced to 0.1 seconds
> which is perfect. However, still if I use
>
>       <field name="short_text" type="text" indexed="false" stored="true"
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>
> term* directives, it gives the same error, so either I will index short_text
> field as well or not use the term* directives. It still gives perfect result
> so I am not using them.
>
> Thanks everyone, I hope they will be useful for others as well.
>
> Serdar
>
> On Sun, May 9, 2010 at 10:05 AM, Serdar Sahin <anlamar...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks. However as I said before, termOffsets/termPositions/termVectors
>> had very little effect on the performance and I don't know why. I have done
>> exactly what you are saying but highlighting 10 documents that have 200-400
>> A4 pages still takes around 2 seconds, depending on the query. I will play
>> with it more.
>>
>> I actually want to highlight (and search) 4 fields not one field if
>> possible. So that's why I have added those four fields to the highlighting.
>> However, what I want is to store only limited character from the plainText
>> field, and I'll use that copyfield as an alternate text field as well. So if
>> it cannot find any matches for highlighting due to limited character size
>> from the plainText field, I can bring description, and if it is not
>> available, then I can bring plainText, and also if it is not available (for
>> example for scanned documents), then I can bring tags or title as a
>> snippet/description and insert it into the web page. So, that's why I need
>> multivalued text field for both sides, for indexing and storing. Just
>> storing will be a little different.
>>
>> I have also tried to duplicate these four fields and use each one of them
>> separately for indexing and storing to avoid indexing both copyfields (my
>> previous email) . The problem was for both copyfields;
>>
>>        <field name="all_text" type="text" indexed="true" stored="false"
>> multiValued="true" />
>>        <field name="short_text" type="text" indexed="true" stored="true"
>> multiValued="true" termVectors="true" termPositions="true"
>> termOffsets="true" />
>>
>> I had to index them even if I use it just for storing. It was giving an
>> error. So I have duplicated these four fields;
>>
>> mydata.xml
>>                                 <field column="plainText"
>> name="plain_text"/>
>>                                 <field column="plainText"
>> name="plain_text_ind"/>
>>
>> schema.xml
>>         <field name="plain_text_ind" type="text" indexed="true"
>> stored="false"/>
>>         <field name="plain_text" type="text" indexed="false"
>> stored="true"/>
>>
>> I have done it for these four fields and created copyfields for them.
>>
>>        <field name="all_text" type="text" indexed="true" stored="false"
>> multiValued="true" />
>>        <field name="short_text" type="text" indexed="false" stored="true"
>> multiValued="true" termVectors="true" termPositions="true"
>> termOffsets="true" />
>>
>> but still it did not work and gave the same error;
>>
>> Caused by: java.lang.RuntimeException: SchemaField: short_text conflicting
>> indexed field options:
>>
>> So, I have disabled term* directives just for testing
>> and successfully indexed the data, but there was no short_text column in the
>> solr index. Maybe duplicating does not work or I have done something wrong.
>>
>> So I guess my only way is to index short_text field as well without
>> duplicating anything.
>>
>>      <field name="all_text" type="text" indexed="true" stored="false"
>> multiValued="true" />
>>        <field name="short_text" type="text" indexed="true" stored="true"
>> multiValued="true" termVectors="true" termPositions="true"
>> termOffsets="true" />
>>
>> Thanks again,
>>
>> Serdar
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On  Sun, May 9, 2010 at 9:24 AM, Lance Norskog <goks...@gmail.com> wrote:
>>
>>> If you want to highlight field X, doing the
>>> termOffsets/termPositions/termVectors will make highlighting that
>>> field faster. You should make a separate field and apply these options
>>> to that field.
>>>
>>> Now: doing a copyfield adds a "value" to a multiValued field. For a
>>> text field, you get a multi-valued text field. You should only copy
>>> one value to the highlighted field, so just copyField the document to
>>> your special field. To enforce this, I would add multiValued="false"
>>> to that field, just to avoid mistakes.
>>>
>>> So, all_text should be indexed without the term* attributes, and
>>> should not be stored. Then your document stored in a separate field
>>> that you use for highlighting and has the term* attributes.
>>>
>>> In general, highlighting has been a problem area all along and there
>>> are little edge cases that I don't know how to solve.
>>>
>>> On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > Thanks a lot for the replies, I could have chance today to test them.
>>> >
>>> > First of all termVectors/termPositions/termOffsets did not help, it has
>>> very
>>> > little effect, but I tried a workaroud, however it is not as efficient
>>> as I
>>> > thought.
>>> >
>>> > From these fields;
>>> >
>>> >        <field name="title" type="text" indexed="true" stored="true"
>>> > required="true" omitNorms="true"/>
>>> >        <field name="description" type="text" indexed="true"
>>> stored="true"
>>> > />
>>> >        <field name="tags" type="text" indexed="true" stored="true"
>>> > omitNorms="true" />
>>> >        <field name="plainText" type="text" indexed="true"
>>> stored="false"/>
>>> >
>>> > I tried to create copyfield
>>> >        <field name="all_text" type="text" indexed="true" stored="true"
>>> > multiValued="true" termVectors="true" termPositions="true"
>>> > termOffsets="true" />
>>> >
>>> >        <copyField source="title" dest="all_text" />
>>> >        <copyField source="tags" dest="all_text" />
>>> >        <copyField source="plainText" dest="all_text"  maxChars="20000"/>
>>> >        <copyField source="description" dest="all_text" />
>>> >
>>> > And I have indexed 1000 documents that have more than 200 pages.
>>> >
>>> > However, maxChars directive also limited the character limit for indexed
>>> > field. For the query of "Institute of Information Systems", it gave 12
>>> > results. I also tried to get unique words from bottom of the text files
>>> and
>>> > search them, they did not give any result. So,  I just wanted to limit
>>> size
>>> > of the stored field, but did not work. Then I tried to create two
>>> copyfields
>>> >
>>> >       <field name="all_text" type="text" indexed="true" stored="false"
>>> > multiValued="true" />
>>> >       <field name="short_text" type="text" indexed="true" stored="true"
>>> > multiValued="true" termVectors="true" termPositions="true"
>>> > termOffsets="true" />
>>> >
>>> >        <copyField source="title" dest="all_text" />
>>> >        <copyField source="tags" dest="all_text" />
>>> >        <copyField source="plainText" dest="all_text" />
>>> >        <copyField source="description" dest="all_text" />
>>> >
>>> >        <copyField source="title" dest="short_text" />
>>> >        <copyField source="tags" dest="short_text" />
>>> >        <copyField source="plainText" dest="short_text"
>>> maxChars="20000"/>
>>> >        <copyField source="description" dest="short_text" />
>>> >
>>> > It gave 168 results, as I expected, and highlighting also worked
>>> reasonably
>>> > fast.However, I don't know Solr/Lucene internals but I guess I store the
>>> > same indexed field twice, and it should have effect on the performance
>>> and
>>> > storage. I tried to make it false but then it gave this error:
>>> >
>>> > Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting
>>> > indexed field options:
>>> >
>>> > So, it was not possible.
>>> >
>>> > Then I tried hl.useFastVectorHighlighter with the latest version (and
>>> yes I
>>> > have turned on termVectors, termPositions, termOffsets) but the result
>>> was
>>> > 2-2.5x slower, which was very strange. Do you have any guess why this
>>> might
>>> > have happened?
>>> >
>>> > 1. Provide another field for highlighting and use copyField
>>> >
>>> > to copy plainText to the highlighting field. When using copyField,
>>> >
>>> > specify maxChars attribute to limit the length of the copy of plainText.
>>> >
>>> > This should work on Solr 1.4.
>>> >
>>> >
>>> > Do you mean I need to duplicate these four fields and use one of them
>>> for
>>> > storing and other one for indexing? Than I guess, I can do;
>>> >
>>> >      <field name="all_text" type="text" indexed="true" stored="false"
>>> > multiValued="true" />
>>> >       <field name="short_text" type="text" indexed="false" stored="true"
>>> > multiValued="true" termVectors="true" termPositions="true"
>>> > termOffsets="true" />  (indexing false)
>>> >
>>> > Is this the only solution? What are the effects on index size and
>>> > performance? Could you give me an advice?
>>> >
>>> > Thanks,
>>> >
>>> > Serdar
>>> >
>>> >
>>> >
>>> > On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com>
>>> wrote:
>>> >
>>> >> Do you have these options turned on when you index the text field:
>>> >> termVectors/termPositions/termOffsets ?
>>> >>
>>> >> Highlighting needs the information created by these anlysis options.
>>> >> If they are not turned on, Solr has load the document text and run the
>>> >> analyzer again with these options on, uses that data to create the
>>> >> highlighting, then throws away the reanalyzed data. Without these
>>> >> options, you are basically re-indexing the document when you highlight
>>> >> it.
>>> >>
>>> >>
>>> >>
>>> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase
>>> >>
>>> >> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp>
>>> wrote:
>>> >> > (10/05/05 22:08), Serdar Sahin wrote:
>>> >> >>
>>> >> >> Hi,
>>> >> >>
>>> >> >> Currently, there are similar topics active in the mailing list, but
>>> it I
>>> >> >> did
>>> >> >> not want to steal the topic.
>>> >> >>
>>> >> >> I have currently indexed 100.000 documents, they are microsoft
>>> >> office/pdf
>>> >> >> etc documents I convert them to TXT files before indexing. Files are
>>> >> >> between
>>> >> >> 1-500 pages. When I search something and filter it to retrieve
>>> documents
>>> >> >> that has more than 100 pages, and activate highlighting, it takes
>>> 0.8-3
>>> >> >> seconds, depending on the query. (10 result per page) If I retrieve
>>> >> >> documents that has 1-5 pages, it drops to 0.1 seconds.
>>> >> >>
>>> >> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the
>>> >> large
>>> >> >> documents, which is more than enough. This problem mostly happens
>>> where
>>> >> >> there are no caches, on the first query. I use this configuration
>>> for
>>> >> >> highlighting:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>>  $query->addHighlightField('description')->addHighlightField('plainText');
>>> >> >>     $query->setHighlightSimplePre('<strong>');
>>> >> >>     $query->setHighlightSimplePost('</strong>');
>>> >> >>     $query->setHighlightHighlightMultiTerm(TRUE);
>>> >> >>     $query->setHighlightMaxAnalyzedChars(10000);
>>> >> >>     $query->setHighlightSnippets(2);
>>> >> >>
>>> >> >> Do you have any suggestions to improve response time while
>>> highlighting
>>> >> is
>>> >> >> active? I have read couple of articles you have previously provided
>>> but
>>> >> >> they
>>> >> >> did not help.
>>> >> >>
>>> >> >> And for the second question, I retrieve these fields:
>>> >> >>
>>> >> >>
>>> $query->addField('title')->addField('cat')->addField('thumbs_up')->
>>> >> >>
>>> addField('thumbs_down')->addField('lang')->addField('id')->
>>> >> >>
>>> >> >>  addField('username')->addField('view_count')->addField('pages')->
>>> >> >>             addField('no_img')->addField('date');
>>> >> >>
>>> >> >> If I can't solve the highlighting problem on large documents, I can
>>> >> simply
>>> >> >> disable it and retrieve first x characters from the plainText (full
>>> >> text)
>>> >> >> field, but is it possible to retrieve first x characters without
>>> using
>>> >> the
>>> >> >> highlighting feature? When I use this;
>>> >> >>     $query->setHighlight(TRUE);
>>> >> >>     $query->setHighlightAlternateField('plainText');
>>> >> >>     $query->setHighlightMaxAnalyzedChars(0);
>>> >> >>     $query->setHighlightMaxAlternateFieldLength(256);
>>> >> >>
>>> >> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300
>>> pages.
>>> >> The
>>> >> >> highlighting still works so it might be the source of the problem, I
>>> >> want
>>> >> >> to
>>> >> >> completely disable it and retrieve only the first 256 characters of
>>> the
>>> >> >> plainText field. Is it possible? It may remove some overhead give
>>> better
>>> >> >> performance.
>>> >> >>
>>> >> >> I personally prefer the highlighting solution but I also would like
>>> to
>>> >> >> hear
>>> >> >> the solution for this problem. For the same query, if I disable
>>> >> >> highlighting
>>> >> >> and without retrieving (but still searching) the plainText field, it
>>> >> drops
>>> >> >> to 0.0094 seconds. So I think if I can get the first 256 characters
>>> >> >> without
>>> >> >> using the highlighting, I will get better performance.
>>> >> >>
>>> >> >> Any suggestions regarding with these two problems will highly
>>> >> appreciated.
>>> >> >>
>>> >> >> Thanks,
>>> >> >>
>>> >> >> Serdar Sahin
>>> >> >>
>>> >> >>
>>> >> >
>>> >> > Hi Serdar,
>>> >> >
>>> >> > There are a few things I think of you can try.
>>> >> >
>>> >> > 1. Provide another field for highlighting and use copyField
>>> >> > to copy plainText to the highlighting field. When using copyField,
>>> >> > specify maxChars attribute to limit the length of the copy of
>>> plainText.
>>> >> > This should work on Solr 1.4.
>>> >> >
>>> >> > 2. If you can use branch_3x version of Solr, try
>>> FastVectorHighlighter.
>>> >> >
>>> >> > Koji
>>> >> >
>>> >> > --
>>> >> > http://www.rondhuit.com/en/
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Lance Norskog
>>> >> goks...@gmail.com
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Highlighting Performance On Large Documents

Reply via email to