To search in a field, it has to be indexed. You can store a field without indexing if you want to highlight it. If you index it with the term* options, it should highlight faster. Since these do not speed up higlighting, your analysis stack is probably very simple. The term* options are variations on indexing - if it is not indexed, these mean nothing and Solr gives you the error message.
"highlighting 10 documents that have 200-400 A4 pages still takes around 2 seconds," I have never seen terms/second or docs/second benchmarks for highlighting. This performance is probably what I would expect. On Sat, May 8, 2010 at 5:34 PM, Serdar Sahin <anlamar...@gmail.com> wrote: > Hi, > > Sorry for the second e-mail, but for the duplication problem, I have done > something wrong, ok now it works, and the query time reduced to 0.1 seconds > which is perfect. However, still if I use > > <field name="short_text" type="text" indexed="false" stored="true" > multiValued="true" termVectors="true" termPositions="true" > termOffsets="true" /> > > term* directives, it gives the same error, so either I will index short_text > field as well or not use the term* directives. It still gives perfect result > so I am not using them. > > Thanks everyone, I hope they will be useful for others as well. > > Serdar > > On Sun, May 9, 2010 at 10:05 AM, Serdar Sahin <anlamar...@gmail.com> wrote: > >> Hi, >> >> Thanks. However as I said before, termOffsets/termPositions/termVectors >> had very little effect on the performance and I don't know why. I have done >> exactly what you are saying but highlighting 10 documents that have 200-400 >> A4 pages still takes around 2 seconds, depending on the query. I will play >> with it more. >> >> I actually want to highlight (and search) 4 fields not one field if >> possible. So that's why I have added those four fields to the highlighting. >> However, what I want is to store only limited character from the plainText >> field, and I'll use that copyfield as an alternate text field as well. So if >> it cannot find any matches for highlighting due to limited character size >> from the plainText field, I can bring description, and if it is not >> available, then I can bring plainText, and also if it is not available (for >> example for scanned documents), then I can bring tags or title as a >> snippet/description and insert it into the web page. So, that's why I need >> multivalued text field for both sides, for indexing and storing. Just >> storing will be a little different. >> >> I have also tried to duplicate these four fields and use each one of them >> separately for indexing and storing to avoid indexing both copyfields (my >> previous email) . The problem was for both copyfields; >> >> <field name="all_text" type="text" indexed="true" stored="false" >> multiValued="true" /> >> <field name="short_text" type="text" indexed="true" stored="true" >> multiValued="true" termVectors="true" termPositions="true" >> termOffsets="true" /> >> >> I had to index them even if I use it just for storing. It was giving an >> error. So I have duplicated these four fields; >> >> mydata.xml >> <field column="plainText" >> name="plain_text"/> >> <field column="plainText" >> name="plain_text_ind"/> >> >> schema.xml >> <field name="plain_text_ind" type="text" indexed="true" >> stored="false"/> >> <field name="plain_text" type="text" indexed="false" >> stored="true"/> >> >> I have done it for these four fields and created copyfields for them. >> >> <field name="all_text" type="text" indexed="true" stored="false" >> multiValued="true" /> >> <field name="short_text" type="text" indexed="false" stored="true" >> multiValued="true" termVectors="true" termPositions="true" >> termOffsets="true" /> >> >> but still it did not work and gave the same error; >> >> Caused by: java.lang.RuntimeException: SchemaField: short_text conflicting >> indexed field options: >> >> So, I have disabled term* directives just for testing >> and successfully indexed the data, but there was no short_text column in the >> solr index. Maybe duplicating does not work or I have done something wrong. >> >> So I guess my only way is to index short_text field as well without >> duplicating anything. >> >> <field name="all_text" type="text" indexed="true" stored="false" >> multiValued="true" /> >> <field name="short_text" type="text" indexed="true" stored="true" >> multiValued="true" termVectors="true" termPositions="true" >> termOffsets="true" /> >> >> Thanks again, >> >> Serdar >> >> >> >> >> >> >> >> >> >> On Sun, May 9, 2010 at 9:24 AM, Lance Norskog <goks...@gmail.com> wrote: >> >>> If you want to highlight field X, doing the >>> termOffsets/termPositions/termVectors will make highlighting that >>> field faster. You should make a separate field and apply these options >>> to that field. >>> >>> Now: doing a copyfield adds a "value" to a multiValued field. For a >>> text field, you get a multi-valued text field. You should only copy >>> one value to the highlighted field, so just copyField the document to >>> your special field. To enforce this, I would add multiValued="false" >>> to that field, just to avoid mistakes. >>> >>> So, all_text should be indexed without the term* attributes, and >>> should not be stored. Then your document stored in a separate field >>> that you use for highlighting and has the term* attributes. >>> >>> In general, highlighting has been a problem area all along and there >>> are little edge cases that I don't know how to solve. >>> >>> On Sat, May 8, 2010 at 7:23 AM, Serdar Sahin <anlamar...@gmail.com> >>> wrote: >>> > Hi, >>> > >>> > Thanks a lot for the replies, I could have chance today to test them. >>> > >>> > First of all termVectors/termPositions/termOffsets did not help, it has >>> very >>> > little effect, but I tried a workaroud, however it is not as efficient >>> as I >>> > thought. >>> > >>> > From these fields; >>> > >>> > <field name="title" type="text" indexed="true" stored="true" >>> > required="true" omitNorms="true"/> >>> > <field name="description" type="text" indexed="true" >>> stored="true" >>> > /> >>> > <field name="tags" type="text" indexed="true" stored="true" >>> > omitNorms="true" /> >>> > <field name="plainText" type="text" indexed="true" >>> stored="false"/> >>> > >>> > I tried to create copyfield >>> > <field name="all_text" type="text" indexed="true" stored="true" >>> > multiValued="true" termVectors="true" termPositions="true" >>> > termOffsets="true" /> >>> > >>> > <copyField source="title" dest="all_text" /> >>> > <copyField source="tags" dest="all_text" /> >>> > <copyField source="plainText" dest="all_text" maxChars="20000"/> >>> > <copyField source="description" dest="all_text" /> >>> > >>> > And I have indexed 1000 documents that have more than 200 pages. >>> > >>> > However, maxChars directive also limited the character limit for indexed >>> > field. For the query of "Institute of Information Systems", it gave 12 >>> > results. I also tried to get unique words from bottom of the text files >>> and >>> > search them, they did not give any result. So, I just wanted to limit >>> size >>> > of the stored field, but did not work. Then I tried to create two >>> copyfields >>> > >>> > <field name="all_text" type="text" indexed="true" stored="false" >>> > multiValued="true" /> >>> > <field name="short_text" type="text" indexed="true" stored="true" >>> > multiValued="true" termVectors="true" termPositions="true" >>> > termOffsets="true" /> >>> > >>> > <copyField source="title" dest="all_text" /> >>> > <copyField source="tags" dest="all_text" /> >>> > <copyField source="plainText" dest="all_text" /> >>> > <copyField source="description" dest="all_text" /> >>> > >>> > <copyField source="title" dest="short_text" /> >>> > <copyField source="tags" dest="short_text" /> >>> > <copyField source="plainText" dest="short_text" >>> maxChars="20000"/> >>> > <copyField source="description" dest="short_text" /> >>> > >>> > It gave 168 results, as I expected, and highlighting also worked >>> reasonably >>> > fast.However, I don't know Solr/Lucene internals but I guess I store the >>> > same indexed field twice, and it should have effect on the performance >>> and >>> > storage. I tried to make it false but then it gave this error: >>> > >>> > Caused by: java.lang.RuntimeException: SchemaField: all_text conflicting >>> > indexed field options: >>> > >>> > So, it was not possible. >>> > >>> > Then I tried hl.useFastVectorHighlighter with the latest version (and >>> yes I >>> > have turned on termVectors, termPositions, termOffsets) but the result >>> was >>> > 2-2.5x slower, which was very strange. Do you have any guess why this >>> might >>> > have happened? >>> > >>> > 1. Provide another field for highlighting and use copyField >>> > >>> > to copy plainText to the highlighting field. When using copyField, >>> > >>> > specify maxChars attribute to limit the length of the copy of plainText. >>> > >>> > This should work on Solr 1.4. >>> > >>> > >>> > Do you mean I need to duplicate these four fields and use one of them >>> for >>> > storing and other one for indexing? Than I guess, I can do; >>> > >>> > <field name="all_text" type="text" indexed="true" stored="false" >>> > multiValued="true" /> >>> > <field name="short_text" type="text" indexed="false" stored="true" >>> > multiValued="true" termVectors="true" termPositions="true" >>> > termOffsets="true" /> (indexing false) >>> > >>> > Is this the only solution? What are the effects on index size and >>> > performance? Could you give me an advice? >>> > >>> > Thanks, >>> > >>> > Serdar >>> > >>> > >>> > >>> > On Sat, May 8, 2010 at 1:00 PM, Lance Norskog <goks...@gmail.com> >>> wrote: >>> > >>> >> Do you have these options turned on when you index the text field: >>> >> termVectors/termPositions/termOffsets ? >>> >> >>> >> Highlighting needs the information created by these anlysis options. >>> >> If they are not turned on, Solr has load the document text and run the >>> >> analyzer again with these options on, uses that data to create the >>> >> highlighting, then throws away the reanalyzed data. Without these >>> >> options, you are basically re-indexing the document when you highlight >>> >> it. >>> >> >>> >> >>> >> >>> http://www.lucidimagination.com/search/out?u=http%3A%2F%2Fwiki.apache.org%2Fsolr%2FFieldOptionsByUseCase >>> >> >>> >> On Wed, May 5, 2010 at 5:01 PM, Koji Sekiguchi <k...@r.email.ne.jp> >>> wrote: >>> >> > (10/05/05 22:08), Serdar Sahin wrote: >>> >> >> >>> >> >> Hi, >>> >> >> >>> >> >> Currently, there are similar topics active in the mailing list, but >>> it I >>> >> >> did >>> >> >> not want to steal the topic. >>> >> >> >>> >> >> I have currently indexed 100.000 documents, they are microsoft >>> >> office/pdf >>> >> >> etc documents I convert them to TXT files before indexing. Files are >>> >> >> between >>> >> >> 1-500 pages. When I search something and filter it to retrieve >>> documents >>> >> >> that has more than 100 pages, and activate highlighting, it takes >>> 0.8-3 >>> >> >> seconds, depending on the query. (10 result per page) If I retrieve >>> >> >> documents that has 1-5 pages, it drops to 0.1 seconds. >>> >> >> >>> >> >> If I disable highlighting, it drops to 0.1-0.2 seconds, even on the >>> >> large >>> >> >> documents, which is more than enough. This problem mostly happens >>> where >>> >> >> there are no caches, on the first query. I use this configuration >>> for >>> >> >> highlighting: >>> >> >> >>> >> >> >>> >> >> >>> >> >>> $query->addHighlightField('description')->addHighlightField('plainText'); >>> >> >> $query->setHighlightSimplePre('<strong>'); >>> >> >> $query->setHighlightSimplePost('</strong>'); >>> >> >> $query->setHighlightHighlightMultiTerm(TRUE); >>> >> >> $query->setHighlightMaxAnalyzedChars(10000); >>> >> >> $query->setHighlightSnippets(2); >>> >> >> >>> >> >> Do you have any suggestions to improve response time while >>> highlighting >>> >> is >>> >> >> active? I have read couple of articles you have previously provided >>> but >>> >> >> they >>> >> >> did not help. >>> >> >> >>> >> >> And for the second question, I retrieve these fields: >>> >> >> >>> >> >> >>> $query->addField('title')->addField('cat')->addField('thumbs_up')-> >>> >> >> >>> addField('thumbs_down')->addField('lang')->addField('id')-> >>> >> >> >>> >> >> addField('username')->addField('view_count')->addField('pages')-> >>> >> >> addField('no_img')->addField('date'); >>> >> >> >>> >> >> If I can't solve the highlighting problem on large documents, I can >>> >> simply >>> >> >> disable it and retrieve first x characters from the plainText (full >>> >> text) >>> >> >> field, but is it possible to retrieve first x characters without >>> using >>> >> the >>> >> >> highlighting feature? When I use this; >>> >> >> $query->setHighlight(TRUE); >>> >> >> $query->setHighlightAlternateField('plainText'); >>> >> >> $query->setHighlightMaxAnalyzedChars(0); >>> >> >> $query->setHighlightMaxAlternateFieldLength(256); >>> >> >> >>> >> >> It still takes 2 seconds if I retrieve 10 rows that has 200-300 >>> pages. >>> >> The >>> >> >> highlighting still works so it might be the source of the problem, I >>> >> want >>> >> >> to >>> >> >> completely disable it and retrieve only the first 256 characters of >>> the >>> >> >> plainText field. Is it possible? It may remove some overhead give >>> better >>> >> >> performance. >>> >> >> >>> >> >> I personally prefer the highlighting solution but I also would like >>> to >>> >> >> hear >>> >> >> the solution for this problem. For the same query, if I disable >>> >> >> highlighting >>> >> >> and without retrieving (but still searching) the plainText field, it >>> >> drops >>> >> >> to 0.0094 seconds. So I think if I can get the first 256 characters >>> >> >> without >>> >> >> using the highlighting, I will get better performance. >>> >> >> >>> >> >> Any suggestions regarding with these two problems will highly >>> >> appreciated. >>> >> >> >>> >> >> Thanks, >>> >> >> >>> >> >> Serdar Sahin >>> >> >> >>> >> >> >>> >> > >>> >> > Hi Serdar, >>> >> > >>> >> > There are a few things I think of you can try. >>> >> > >>> >> > 1. Provide another field for highlighting and use copyField >>> >> > to copy plainText to the highlighting field. When using copyField, >>> >> > specify maxChars attribute to limit the length of the copy of >>> plainText. >>> >> > This should work on Solr 1.4. >>> >> > >>> >> > 2. If you can use branch_3x version of Solr, try >>> FastVectorHighlighter. >>> >> > >>> >> > Koji >>> >> > >>> >> > -- >>> >> > http://www.rondhuit.com/en/ >>> >> > >>> >> > >>> >> >>> >> >>> >> >>> >> -- >>> >> Lance Norskog >>> >> goks...@gmail.com >>> >> >>> > >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >> >> > -- Lance Norskog goks...@gmail.com