Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Walter Underwood
A flat distribution of queries is a poor test. Real queries have a zipf distribution. The flat distribution will get almost no benefit from caching, so it will give too low a number and stress disk IO too much. The 99th percentile is probably the same for both distributions, because that is domi

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Toke Eskildsen
On Wed, 2013-10-30 at 14:24 +0100, Shawn Heisey wrote: > On 10/30/2013 4:00 AM, Toke Eskildsen wrote: > > Why would TRIM have any influence on whether or not a driver failure > > also means server failure? > > I left out a step in my description. > > Lack of TRIM support in RAID means that I woul

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Shawn Heisey
On 10/30/2013 4:00 AM, Toke Eskildsen wrote: > On Tue, 2013-10-29 at 16:41 +0100, Shawn Heisey wrote: >> If you put the index on SSD, you could get by with less RAM, but a RAID >> solution that works properly with SSD (TRIM support) is hard to find, so >> SSD failure in most situations effectively

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread eShard
Wow again! Thank you all very much for your insights. We will certainly take all of this under consideration. Erik: I want to upgrade but unfortunately, it's not up to me. You're right, we definitely need to do it. And SolrJ sounds interesting, thanks for the suggestions. By the way, is ther

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Toke Eskildsen
On Tue, 2013-10-29 at 16:41 +0100, Shawn Heisey wrote: > If you put the index on SSD, you could get by with less RAM, but a RAID > solution that works properly with SSD (TRIM support) is hard to find, so > SSD failure in most situations effectively means a server failure. Solr > and Lucene have a

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-30 Thread Toke Eskildsen
On Tue, 2013-10-29 at 14:24 +0100, eShard wrote: > I have a 1 TB repository with approximately 500,000 documents (that will > probably grow from there) that needs to be indexed. As Shawn point out, that isn't telling us much. If you describe the documents, how and how often you index and how you

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-29 Thread Erick Erickson
In addition to Shawn's comments... bq: we're close to beta release, so I can't upgrade right now WHO! You say you're close to release but you haven't successfully crawled the data even once? Upgrading to 4.5.1 is a trivial risk compared to that statement! This is setting itself up for a real

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-29 Thread Shawn Heisey
On 10/29/2013 10:44 AM, eShard wrote: Offhand, how do I control how much of the index is held in RAM? Can you point me in the right direction? This is automatically handled by the operating system. For quite some time, Solr (Lucene) has by default used the MMap functionality provided by all

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-29 Thread eShard
P.S. Offhand, how do I control how much of the index is held in RAM? Can you point me in the right direction? Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-and-specs-to-index-a-1-terabyte-TB-repository-tp4098227p4098260.html Sent from the Solr - User

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-29 Thread eShard
Wow, thanks for your response. You raise a lot of great questions; I wish I had the answers! We're still trying to get enough resources to finish crawling the repository, so I don't even know what the final size of the index will be. I've thought about excluding the videos and other large files and

Re: Configuration and specs to index a 1 terabyte (TB) repository

2013-10-29 Thread Shawn Heisey
On 10/29/2013 7:24 AM, eShard wrote: > Good morning, > I have a 1 TB repository with approximately 500,000 documents (that will > probably grow from there) that needs to be indexed. > I'm limited to Solr 4.0 final (we're close to beta release, so I can't > upgrade right now) and I can't use SolrC