Re: Solr feasibility with terabyte-scale data

2008-05-11 Thread Marcus Herou
; > (must not be SPOF!) for shard management. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > - Original Message > > > From: marcusherou <[EMAIL PROTECTED]> > > > To

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
> -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: marcusherou <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Sent: Friday, May 9, 2008 2:37:19 AM > > Subject: Re: Solr feasibility with

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Thanks Ken. I will take a look be sure of that :) Kindly //Marcus On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi Marcus, > > It seems a lot of what you're describing is really similar to MapReduce, >> so I think Otis' suggestion to look at Hadoop is a good one: i

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
w, let's see what happens there! :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ken Krugler <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, May 9, 2008 5:37:19 PM > Subject: Re: Solr feasibili

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler
s replicating core Solr (or Nutch) functionality, then it sucks. Not sure what the outcome will be. -- Ken - Original Message From: Ken Krugler <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, May 9, 2008 4:26:19 PM Subject: Re: Solr feasibility with terabyte-scale

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
t -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ken Krugler <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, May 9, 2008 4:26:19 PM > Subject: Re: Solr feasibility with terabyte-scale data > > Hi Marcus, > > >I

RE: Solr feasibility with terabyte-scale data

2008-05-09 Thread Lance Norskog
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the MD5 cryptographic checksumming algorithm. This takes X bytes of data and creates a 128-bit long "random" number, or 128 "random" bits. At this point there are no reports of two different datasets that give the same checksum

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler
Hi Marcus, It seems a lot of what you're describing is really similar to MapReduce, so I think Otis' suggestion to look at Hadoop is a good one: it might prevent a lot of headaches and they've already solved a lot of the tricky problems. There a number of ridiculously sized projects using it

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
o keep their caches warm instead? Should we expect Solr indexing time to slow significantly as we scale up? What kind of query performance could we expect? Is it totally naive even to consider Solr at this kind of scale? Given these parameters is it realistic to think that Solr could handle

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
ne - Solr - Nutch - Original Message > From: marcusherou <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, May 9, 2008 2:37:19 AM > Subject: Re: Solr feasibility with terabyte-scale data > > > Hi. > > I will as well head into a path li

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Marcus Herou
>> millisecond query response. >>> >>> Our environment makes available Apache on blade servers (Dell 1955 dual >>> dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, >>> high-performance NAS system over a dedicated (out-of-band) GbE switch >>> (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting >>> with 2 blades and will add as demands require. >>> >>> While we have a lot of storage, the idea of master/slave Solr Collection >>> Distribution to add more Solr instances clearly means duplicating an >>> immense index. Is it possible to use one instance to update the index >>> on NAS while other instances only read the index and commit to keep >>> their caches warm instead? >>> >>> Should we expect Solr indexing time to slow significantly as we scale >>> up? What kind of query performance could we expect? Is it totally >>> naive even to consider Solr at this kind of scale? >>> >>> Given these parameters is it realistic to think that Solr could handle >>> the task? >>> >>> Any advice/wisdom greatly appreciated, >>> >>> Phil >>> >>> >>> >>> >> -- >> View this message in context: >> http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
isdom greatly appreciated, Phil -- View this message in context: http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr feasibility with terabyte-scale data

2008-05-08 Thread marcusherou
; up? What kind of query performance could we expect? Is it totally > naive even to consider Solr at this kind of scale? > > Given these parameters is it realistic to think that Solr could handle > the task? > > Any advice/wisdom greatly appreciated, > > Phil > > > -- View this message in context: http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr feasibility with terabyte-scale data

2008-01-23 Thread Lance Norskog
PROTECTED] Sent: Wednesday, January 23, 2008 8:15 AM To: solr-user@lucene.apache.org Subject: Re: Solr feasibility with terabyte-scale data For sure this is a problem. We have considered some strategies. One might be to use a dictionary to clean up the OCR but that gets hard for proper names and

Re: Solr feasibility with terabyte-scale data

2008-01-23 Thread Phillip Farber
For sure this is a problem. We have considered some strategies. One might be to use a dictionary to clean up the OCR but that gets hard for proper names and technical jargon. Another is to use stop words (which has the unfortunate side effect of making phrase searches like "to be or not to be

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Mike Klaas
On 22-Jan-08, at 4:20 PM, Phillip Farber wrote: We would need all 7M ids scored so we could push them through a filter query to reduce them to a much smaller number on the order of 100-10,000 representing just those that correspond to items in a collection. You could pass the filter to S

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Erick Erickson
Just to add another wrinkle, how clean is your OCR? I've seen it range from very nice (i.e. 99.9% of the words are actually words) to horrible (60%+ of the "words" are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in e

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
xt.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, January 18, 2008 5:26:21 PM Subject: Solr feasibility with terabyte-scale data Hello everyone, We are considering Solr 1.2 to index and search

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Mike Klaas
On 22-Jan-08, at 11:05 AM, Phillip Farber wrote: Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we are talking perhaps 50K words total to index so as you point out the index might not be too big. It's the *data* that is big not the *index*, right?. So I don't think S

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Ryan McKinley
Obviously as the number of documents increase the index size must increase to some degree -- I think linearly? But what index size will result for 7M documents over 50K words where we're talking just 2 fields per doc: 1 id field and one OCR field of ~1.4M? Ballpark? Regarding single word qu

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large

Re: Solr feasibility with terabyte-scale data

2008-01-20 Thread Otis Gospodnetic
- Solr - Nutch - Original Message From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, January 18, 2008 5:26:21 PM Subject: Solr feasibility with terabyte-scale data Hello everyone, We are considering Solr 1.2 to index and search a terabyte-s

Re: Solr feasibility with terabyte-scale data

2008-01-19 Thread Ryan McKinley
We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text field, indexed not

Re: Solr feasibility with terabyte-scale data

2008-01-18 Thread Srikant Jakilinki
Nice description of a use-case. My 2 pennies embedded... Phillip Farber wrote: Hello everyone, We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple to

Solr feasibility with terabyte-scale data

2008-01-18 Thread Phillip Farber
Hello everyone, We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text fiel