Optimizing a schema

2006-08-08 Thread bo_b

Hello, 

I have tried indexing a vbulletin message board, containing roughly 7
million posts.

My schema is as follows:

   
   
   
   
   
   
   

 postid

   
   

I am trying to figure out if there is anything I can do to lower the disk
usage and or increase sorting speed before we go live with the search. So a
few questions came to mind

1) Sorting I was planning to do on the date field(aka add "; date desc").
But I was wondering if it would be more efficient to sort on postid
instead(since higher postid in vbulletin=newer post). I already have
indexed=true for postid since its our unique field, but then i could set
indexed=false for date, and perhaps save some storage space?

2) If we sort on postid instead, would we need to use integer, or the sint
type? I assume sint would be faster(?) but perhaps use more storage?

3) About Omitnorms=true, I must admit i dont exactly understand what it does
:) But I read that it would save 1 byte pr document. Are the any other
fields I need to add it to in my schema? As far as I understand
Omitnorms=true only makes a difference for indexed=true fields, and doesnt
do anything for int fields?

Thanks in advance for any suggestions :)

/Bo
-- 
View this message in context: 
http://www.nabble.com/Optimizing-a-schema-tf2071403.html#a5702635
Sent from the Solr - User forum at Nabble.com.



Incremental updates/Sorting problem

2006-08-08 Thread bo_b

Hello,

As mentioned in another post i am trying to index a vbulletin database
containing roughly 7 million posts. The very first query where I apply
sorting after a full indexing, seems to take roughly 264998
ms. Subsequent searches are fast.

I figure the reason is as Chris explained
here(http://www.mail-archive.com/solr-user@lucene.apache.org/msg00457.html)
that 

"Sorting on a field requires building a FieldCache for every document --
regardless of how many documents match your query.  This cache is reused
for all searches thta sort on that field."

However my problem is that I would like to be able to incrementally add new
postings to the index, as they occur. And it appears that if i add just 1
post, and do a  that solr/lucene rebuilds FieldCaches for the entire
index, not just the newly added posts. Thus rendering my index unsearchable
for the next roughly 264 seconds(at least for sorting queries).. 

Is there any solution to this problem? I would like to be able to sort, but
we cant live with 264 second downtime after every commit.

/Bo
-- 
View this message in context: 
http://www.nabble.com/Incremental-updates-Sorting-problem-tf2071518.html#a5702953
Sent from the Solr - User forum at Nabble.com.



Re: Incremental updates/Sorting problem

2006-08-08 Thread karl wettin
On Tue, 2006-08-08 at 02:14 -0700, bo_b wrote:
> Is there any solution to this problem? I would like to be able to sort, but
> we cant live with 264 second downtime after every commit.

There has been many long threads in the Lucene-users forum on this
subject. Try searching for "sorting" in subject. I personally suggest a
List where index is the document number and the value the global
order, set by iterating TermEnum and TermDocs at index time.

But many people think this is a bad solution. So read threads to catch
up on the alternatives.



Re: Incremental updates/Sorting problem

2006-08-08 Thread Yonik Seeley

On 8/8/06, bo_b <[EMAIL PROTECTED]> wrote:

As mentioned in another post i am trying to index a vbulletin database
containing roughly 7 million posts. The very first query where I apply
sorting after a full indexing, seems to take roughly 264998
ms. Subsequent searches are fast.

I figure the reason is as Chris explained
here(http://www.mail-archive.com/solr-user@lucene.apache.org/msg00457.html)
that

"Sorting on a field requires building a FieldCache for every document --
regardless of how many documents match your query.  This cache is reused
for all searches thta sort on that field."

However my problem is that I would like to be able to incrementally add new
postings to the index, as they occur.



And it appears that if i add just 1
post, and do a  that solr/lucene rebuilds FieldCaches for the entire
index, not just the newly added posts. Thus rendering my index unsearchable
for the next roughly 264 seconds(at least for sorting queries)..


Warming (either normal or auto-warming) will solve the problem of the
long first search.
Warming is done in the background, so no "real" live queries will see
that long delay.

That said, 264 seconds is a *long* time to build a FieldCache entry,
even for 6M documents.  Make sure that you have enough heap size and
that running out of memory isn't causing the GC to hog the CPU.

That doesn't solve the  after every  problem though.
That's not the type of thing that Lucene (and Solr) are optimized for.
Most search collections can tollerate a few minutes of lag until new
documents become searchable.

-Yonik


Re: Optimizing a schema

2006-08-08 Thread Yonik Seeley

On 8/8/06, bo_b <[EMAIL PROTECTED]> wrote:

I have tried indexing a vbulletin message board, containing roughly 7
million posts.

My schema is as follows:

   
   
   
   
   
   
   

 postid

   
   

I am trying to figure out if there is anything I can do to lower the disk
usage and or increase sorting speed before we go live with the search. So a
few questions came to mind

1) Sorting I was planning to do on the date field(aka add "; date desc").
But I was wondering if it would be more efficient to sort on postid
instead(since higher postid in vbulletin=newer post).


No, they will be roughly the same speed.
What you *could* try to do is always *index* documents in postid/date
order... then sorting would not require any FieldCache entry.  It
would require a minor change to Solr (allow sorting on lucene internal
docid, which matches the order that documents are added to an index).


2) If we sort on postid instead, would we need to use integer, or the sint
type? I assume sint would be faster(?) but perhaps use more storage?


If you need range queries, SortableIntField  values are ordered
correctly for them to work.
For sorting, both int and sint fields work... the difference is in how
the FieldCache entry is built.
 For IntField, an Integer.parseInt(str) needs to be done for each distinct str.
 SortableIntField is sorted like strings... the ordinal (order in the
index) is recorded for each distinct value.

 So sint will build the FieldCache faster, but the string values will
cause the entry to be larger.  Aftert the FieldCache entry is built,
both int and sint should be comparable in speed.


3) About Omitnorms=true, I must admit i dont exactly understand what it does
:) But I read that it would save 1 byte pr document.


One byte per document for that indexed field, regardless of if the
field exists for all documents or not.  You loose length normalization
(an increase in score for matches on shorter fields... not needed if
it's not a full-text field anyway), and you loose index-time boosts
(which it doesn't look like you are using).

Since "blob" looks like the body of the post, I think you probably
*want* norms to get the length normalization factors.  Probably all
other indexed fields can have omitNorms="true" (including postid)


 Are the any other
fields I need to add it to in my schema? As far as I understand
Omitnorms=true only makes a difference for indexed=true fields, and doesnt
do anything for int fields?


omitNorms=true will omit norms for *any* indexed field, including int
fields.  Deep inside Lucene, all indexed fields are string fields.

-Yonik


Re: Optimizing a schema

2006-08-08 Thread bo_b


Yonik Seeley wrote:
> 
> No, they will be roughly the same speed.
> What you *could* try to do is always *index* documents in postid/date
> order... then sorting would not require any FieldCache entry.  It
> would require a minor change to Solr (allow sorting on lucene internal
> docid, which matches the order that documents are added to an index).
> 
OK, I will look into that, if would be nice to avoid the delay when building
fieldcaches after a commit. 


Yonik Seeley wrote:
> 
> If you need range queries, SortableIntField  values are ordered
> correctly for them to work.
> For sorting, both int and sint fields work... the difference is in how
> the FieldCache entry is built.
>   For IntField, an Integer.parseInt(str) needs to be done for each
> distinct str.
>   SortableIntField is sorted like strings... the ordinal (order in the
> index) is recorded for each distinct value.
> 
>   So sint will build the FieldCache faster, but the string values will
> cause the entry to be larger.  Aftert the FieldCache entry is built,
> both int and sint should be comparable in speed.
> 

I will test it and see what works the best, I think we would prefer being
able to build the fieldcaches faster.

Thanks for all the helpful explanations/hints :)
-- 
View this message in context: 
http://www.nabble.com/Optimizing-a-schema-tf2071403.html#a5708887
Sent from the Solr - User forum at Nabble.com.