If you want to highlight field X, doing the termOffsets/termPositions/termVectors will make highlighting that field faster. You should make a separate field and apply these options to that field.
Now: doing a copyfield adds a "value" to a multiValued field. For a text field, you get a multi-valued text field. You should only copy one value to the highlighted field, so just copyField the document to your special field. To enforce this, I would add multiValued="false" to that field, just to avoid mistakes. So, all_text should be indexed without the term* attributes, and should not be stored. Then your document stored in a separate field that you use for highlighting and has the term* attributes. In general, highlighting has been a problem area all along and there are little edge cases that I don't know how to solve. On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com> wrote: > Hi, > > Thanks a lot for the replies, I could have chance today to test them. > > First of all termVectors/termPositions/termOffsets did not help, it has very > little effect, but I tried a workaroud, however it is not as efficient as I > thought. > > From these fields; > > <field name="title" type="text" indexed="true" stored="true" > required="true" omitNorms="true"/> > <field name="description" type="text" indexed="true" stored="true" > /> > <field name="tags" type="text" indexed="true" stored="true" > omitNorms="true" /> > <field name="plainText" type="text" indexed="true" stored="false"/> > > I tried to create copyfield > <field name="all_text" type="text" indexed="true" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > > <copyField source="title" dest="all_text" /> > <copyField source="tags" dest="all_text" /> > <copyField source="plainText" dest="all_text" maxChars="20000"/> > <copyField source="description" dest="all_text" /> > > And I have indexed 1000 documents that have more than 200 pages. > > However, maxChars directive also limited the character limit for indexed > field. For the query of "Institute of Information Systems", it gave 12 > results. I also tried to get unique words from bottom of the text files and > search them, they did not give any result. So, I just wanted to limit size > of the stored field, but did not work. Then I tried to create two copyfields > > <field name="all_text" type="text" indexed="true" stored="false" > multiValued="true" /> > <field name="short_text" type="text" indexed="true" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > > <copyField source="title" dest="all_text" /> > <copyField source="tags" dest="all_text" /> > <copyField source="plainText" dest="all_text" /> > <copyField source="description" dest="all_text" /> > > <copyField source="title" dest="short_text" /> > <copyField source="tags" dest="short_text" /> > <copyField source="plainText" dest="short_text" maxChars="20000"/> > <copyField source="description" dest="short_text" /> > > It gave 168 results, as I expected, and highlighting also worked reasonably > fast.However, I don't know Solr/Lucene internals but I guess I store the > same indexed field twice, and it should have effect on the performance and > storage. I tried to make it false but then it gave this error: > > Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting > indexed field options: > > So, it was not possible. > > Then I tried hl.useFastVectorHighlighter with the latest version (and yes I > have turned on termVectors, termPositions, termOffsets) but the result was > 2-2.5x slower, which was very strange. Do you have any guess why this might > have happened? > > 1. Provide another field for highlighting and use copyField > > to copy plainText to the highlighting field. When using copyField, > > specify maxChars attribute to limit the length of the copy of plainText. > > This should work on Solr 1.4. > > > Do you mean I need to duplicate these four fields and use one of them for > storing and other one for indexing? Than I guess, I can do; > > <field name="all_text" type="text" indexed="true" stored="false" > multiValued="true" /> > <field name="short_text" type="text" indexed="false" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> (indexing false) > > Is this the only solution? What are the effects on index size and > performance? Could you give me an advice? > > Thanks, > > Serdar > > > > On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com> wrote: > >> Do you have these options turned on when you index the text field: >> termVectors/termPositions/termOffsets ? >> >> Highlighting needs the information created by these anlysis options. >> If they are not turned on, Solr has load the document text and run the >> analyzer again with these options on, uses that data to create the >> highlighting, then throws away the reanalyzed data. Without these >> options, you are basically re-indexing the document when you highlight >> it. >> >> >> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase >> >> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp> wrote: >> > (10/05/05 22:08), Serdar Sahin wrote: >> >> >> >> Hi, >> >> >> >> Currently, there are similar topics active in the mailing list, but it I >> >> did >> >> not want to steal the topic. >> >> >> >> I have currently indexed 100.000 documents, they are microsoft >> office/pdf >> >> etc documents I convert them to TXT files before indexing. Files are >> >> between >> >> 1-500 pages. When I search something and filter it to retrieve documents >> >> that has more than 100 pages, and activate highlighting, it takes 0.8-3 >> >> seconds, depending on the query. (10 result per page) If I retrieve >> >> documents that has 1-5 pages, it drops to 0.1 seconds. >> >> >> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the >> large >> >> documents, which is more than enough. This problem mostly happens where >> >> there are no caches, on the first query. I use this configuration for >> >> highlighting: >> >> >> >> >> >> >> $query->addHighlightField('description')->addHighlightField('plainText'); >> >> $query->setHighlightSimplePre('<strong>'); >> >> $query->setHighlightSimplePost('</strong>'); >> >> $query->setHighlightHighlightMultiTerm(TRUE); >> >> $query->setHighlightMaxAnalyzedChars(10000); >> >> $query->setHighlightSnippets(2); >> >> >> >> Do you have any suggestions to improve response time while highlighting >> is >> >> active? I have read couple of articles you have previously provided but >> >> they >> >> did not help. >> >> >> >> And for the second question, I retrieve these fields: >> >> >> >> $query->addField('title')->addField('cat')->addField('thumbs_up')-> >> >> addField('thumbs_down')->addField('lang')->addField('id')-> >> >> >> >> addField('username')->addField('view_count')->addField('pages')-> >> >> addField('no_img')->addField('date'); >> >> >> >> If I can't solve the highlighting problem on large documents, I can >> simply >> >> disable it and retrieve first x characters from the plainText (full >> text) >> >> field, but is it possible to retrieve first x characters without using >> the >> >> highlighting feature? When I use this; >> >> $query->setHighlight(TRUE); >> >> $query->setHighlightAlternateField('plainText'); >> >> $query->setHighlightMaxAnalyzedChars(0); >> >> $query->setHighlightMaxAlternateFieldLength(256); >> >> >> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300 pages. >> The >> >> highlighting still works so it might be the source of the problem, I >> want >> >> to >> >> completely disable it and retrieve only the first 256 characters of the >> >> plainText field. Is it possible? It may remove some overhead give better >> >> performance. >> >> >> >> I personally prefer the highlighting solution but I also would like to >> >> hear >> >> the solution for this problem. For the same query, if I disable >> >> highlighting >> >> and without retrieving (but still searching) the plainText field, it >> drops >> >> to 0.0094 seconds. So I think if I can get the first 256 characters >> >> without >> >> using the highlighting, I will get better performance. >> >> >> >> Any suggestions regarding with these two problems will highly >> appreciated. >> >> >> >> Thanks, >> >> >> >> Serdar Sahin >> >> >> >> >> > >> > Hi Serdar, >> > >> > There are a few things I think of you can try. >> > >> > 1. Provide another field for highlighting and use copyField >> > to copy plainText to the highlighting field. When using copyField, >> > specify maxChars attribute to limit the length of the copy of plainText. >> > This should work on Solr 1.4. >> > >> > 2. If you can use branch_3x version of Solr, try FastVectorHighlighter. >> > >> > Koji >> > >> > -- >> > http://www.rondhuit.com/en/ >> > >> > >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> > -- Lance Norskog goks...@gmail.com