Optimizing a schema
Hello, I have tried indexing a vbulletin message board, containing roughly 7 million posts. My schema is as follows: postid I am trying to figure out if there is anything I can do to lower the disk usage and or increase sorting speed before we go live with the search. So a few questions came to mind 1) Sorting I was planning to do on the date field(aka add "; date desc"). But I was wondering if it would be more efficient to sort on postid instead(since higher postid in vbulletin=newer post). I already have indexed=true for postid since its our unique field, but then i could set indexed=false for date, and perhaps save some storage space? 2) If we sort on postid instead, would we need to use integer, or the sint type? I assume sint would be faster(?) but perhaps use more storage? 3) About Omitnorms=true, I must admit i dont exactly understand what it does :) But I read that it would save 1 byte pr document. Are the any other fields I need to add it to in my schema? As far as I understand Omitnorms=true only makes a difference for indexed=true fields, and doesnt do anything for int fields? Thanks in advance for any suggestions :) /Bo -- View this message in context: http://www.nabble.com/Optimizing-a-schema-tf2071403.html#a5702635 Sent from the Solr - User forum at Nabble.com.
Incremental updates/Sorting problem
Hello, As mentioned in another post i am trying to index a vbulletin database containing roughly 7 million posts. The very first query where I apply sorting after a full indexing, seems to take roughly 264998 ms. Subsequent searches are fast. I figure the reason is as Chris explained here(http://www.mail-archive.com/solr-user@lucene.apache.org/msg00457.html) that "Sorting on a field requires building a FieldCache for every document -- regardless of how many documents match your query. This cache is reused for all searches thta sort on that field." However my problem is that I would like to be able to incrementally add new postings to the index, as they occur. And it appears that if i add just 1 post, and do a that solr/lucene rebuilds FieldCaches for the entire index, not just the newly added posts. Thus rendering my index unsearchable for the next roughly 264 seconds(at least for sorting queries).. Is there any solution to this problem? I would like to be able to sort, but we cant live with 264 second downtime after every commit. /Bo -- View this message in context: http://www.nabble.com/Incremental-updates-Sorting-problem-tf2071518.html#a5702953 Sent from the Solr - User forum at Nabble.com.
Re: Incremental updates/Sorting problem
On Tue, 2006-08-08 at 02:14 -0700, bo_b wrote: > Is there any solution to this problem? I would like to be able to sort, but > we cant live with 264 second downtime after every commit. There has been many long threads in the Lucene-users forum on this subject. Try searching for "sorting" in subject. I personally suggest a List where index is the document number and the value the global order, set by iterating TermEnum and TermDocs at index time. But many people think this is a bad solution. So read threads to catch up on the alternatives.
Re: Incremental updates/Sorting problem
On 8/8/06, bo_b <[EMAIL PROTECTED]> wrote: As mentioned in another post i am trying to index a vbulletin database containing roughly 7 million posts. The very first query where I apply sorting after a full indexing, seems to take roughly 264998 ms. Subsequent searches are fast. I figure the reason is as Chris explained here(http://www.mail-archive.com/solr-user@lucene.apache.org/msg00457.html) that "Sorting on a field requires building a FieldCache for every document -- regardless of how many documents match your query. This cache is reused for all searches thta sort on that field." However my problem is that I would like to be able to incrementally add new postings to the index, as they occur. And it appears that if i add just 1 post, and do a that solr/lucene rebuilds FieldCaches for the entire index, not just the newly added posts. Thus rendering my index unsearchable for the next roughly 264 seconds(at least for sorting queries).. Warming (either normal or auto-warming) will solve the problem of the long first search. Warming is done in the background, so no "real" live queries will see that long delay. That said, 264 seconds is a *long* time to build a FieldCache entry, even for 6M documents. Make sure that you have enough heap size and that running out of memory isn't causing the GC to hog the CPU. That doesn't solve the after every problem though. That's not the type of thing that Lucene (and Solr) are optimized for. Most search collections can tollerate a few minutes of lag until new documents become searchable. -Yonik
Re: Optimizing a schema
On 8/8/06, bo_b <[EMAIL PROTECTED]> wrote: I have tried indexing a vbulletin message board, containing roughly 7 million posts. My schema is as follows: postid I am trying to figure out if there is anything I can do to lower the disk usage and or increase sorting speed before we go live with the search. So a few questions came to mind 1) Sorting I was planning to do on the date field(aka add "; date desc"). But I was wondering if it would be more efficient to sort on postid instead(since higher postid in vbulletin=newer post). No, they will be roughly the same speed. What you *could* try to do is always *index* documents in postid/date order... then sorting would not require any FieldCache entry. It would require a minor change to Solr (allow sorting on lucene internal docid, which matches the order that documents are added to an index). 2) If we sort on postid instead, would we need to use integer, or the sint type? I assume sint would be faster(?) but perhaps use more storage? If you need range queries, SortableIntField values are ordered correctly for them to work. For sorting, both int and sint fields work... the difference is in how the FieldCache entry is built. For IntField, an Integer.parseInt(str) needs to be done for each distinct str. SortableIntField is sorted like strings... the ordinal (order in the index) is recorded for each distinct value. So sint will build the FieldCache faster, but the string values will cause the entry to be larger. Aftert the FieldCache entry is built, both int and sint should be comparable in speed. 3) About Omitnorms=true, I must admit i dont exactly understand what it does :) But I read that it would save 1 byte pr document. One byte per document for that indexed field, regardless of if the field exists for all documents or not. You loose length normalization (an increase in score for matches on shorter fields... not needed if it's not a full-text field anyway), and you loose index-time boosts (which it doesn't look like you are using). Since "blob" looks like the body of the post, I think you probably *want* norms to get the length normalization factors. Probably all other indexed fields can have omitNorms="true" (including postid) Are the any other fields I need to add it to in my schema? As far as I understand Omitnorms=true only makes a difference for indexed=true fields, and doesnt do anything for int fields? omitNorms=true will omit norms for *any* indexed field, including int fields. Deep inside Lucene, all indexed fields are string fields. -Yonik
Re: Optimizing a schema
Yonik Seeley wrote: > > No, they will be roughly the same speed. > What you *could* try to do is always *index* documents in postid/date > order... then sorting would not require any FieldCache entry. It > would require a minor change to Solr (allow sorting on lucene internal > docid, which matches the order that documents are added to an index). > OK, I will look into that, if would be nice to avoid the delay when building fieldcaches after a commit. Yonik Seeley wrote: > > If you need range queries, SortableIntField values are ordered > correctly for them to work. > For sorting, both int and sint fields work... the difference is in how > the FieldCache entry is built. > For IntField, an Integer.parseInt(str) needs to be done for each > distinct str. > SortableIntField is sorted like strings... the ordinal (order in the > index) is recorded for each distinct value. > > So sint will build the FieldCache faster, but the string values will > cause the entry to be larger. Aftert the FieldCache entry is built, > both int and sint should be comparable in speed. > I will test it and see what works the best, I think we would prefer being able to build the fieldcaches faster. Thanks for all the helpful explanations/hints :) -- View this message in context: http://www.nabble.com/Optimizing-a-schema-tf2071403.html#a5708887 Sent from the Solr - User forum at Nabble.com.