; > (must not be SPOF!) for shard management.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message
> > > From: marcusherou <[EMAIL PROTECTED]>
> > > To
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message
> > From: marcusherou <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 9, 2008 2:37:19 AM
> > Subject: Re: Solr feasibility with
Thanks Ken.
I will take a look be sure of that :)
Kindly
//Marcus
On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]>
wrote:
> Hi Marcus,
>
> It seems a lot of what you're describing is really similar to MapReduce,
>> so I think Otis' suggestion to look at Hadoop is a good one: i
w, let's see what happens there! :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 5:37:19 PM
> Subject: Re: Solr feasibili
s replicating core Solr (or Nutch)
functionality, then it sucks. Not sure what the outcome will be.
-- Ken
- Original Message
From: Ken Krugler <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale
t -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 4:26:19 PM
> Subject: Re: Solr feasibility with terabyte-scale data
>
> Hi Marcus,
>
> >I
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long "random" number, or 128 "random" bits. At this point
there are no reports of two different datasets that give the same checksum
Hi Marcus,
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it
o keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we
scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could
handle
ne - Solr - Nutch
- Original Message
> From: marcusherou <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 2:37:19 AM
> Subject: Re: Solr feasibility with terabyte-scale data
>
>
> Hi.
>
> I will as well head into a path li
>> millisecond query response.
>>>
>>> Our environment makes available Apache on blade servers (Dell 1955 dual
>>> dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
>>> high-performance NAS system over a dedicated (out-of-band) GbE switch
>>> (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
>>> with 2 blades and will add as demands require.
>>>
>>> While we have a lot of storage, the idea of master/slave Solr Collection
>>> Distribution to add more Solr instances clearly means duplicating an
>>> immense index. Is it possible to use one instance to update the index
>>> on NAS while other instances only read the index and commit to keep
>>> their caches warm instead?
>>>
>>> Should we expect Solr indexing time to slow significantly as we scale
>>> up? What kind of query performance could we expect? Is it totally
>>> naive even to consider Solr at this kind of scale?
>>>
>>> Given these parameters is it realistic to think that Solr could handle
>>> the task?
>>>
>>> Any advice/wisdom greatly appreciated,
>>>
>>> Phil
>>>
>>>
>>>
>>>
>> --
>> View this message in context:
>> http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
isdom greatly appreciated,
Phil
--
View this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
; up? What kind of query performance could we expect? Is it totally
> naive even to consider Solr at this kind of scale?
>
> Given these parameters is it realistic to think that Solr could handle
> the task?
>
> Any advice/wisdom greatly appreciated,
>
> Phil
>
>
>
--
View this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
PROTECTED]
Sent: Wednesday, January 23, 2008 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr feasibility with terabyte-scale data
For sure this is a problem. We have considered some strategies. One might
be to use a dictionary to clean up the OCR but that gets hard for proper
names and
For sure this is a problem. We have considered some strategies. One
might be to use a dictionary to clean up the OCR but that gets hard for
proper names and technical jargon. Another is to use stop words (which
has the unfortunate side effect of making phrase searches like "to be or
not to be
On 22-Jan-08, at 4:20 PM, Phillip Farber wrote:
We would need all 7M ids scored so we could push them through a
filter query to reduce them to a much smaller number on the order
of 100-10,000 representing just those that correspond to items in a
collection.
You could pass the filter to S
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the "words" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in e
xt.com/ -- Lucene - Solr - Nutch
- Original Message
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data
Hello everyone,
We are considering Solr 1.2 to index and search
On 22-Jan-08, at 11:05 AM, Phillip Farber wrote:
Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so
we are talking perhaps 50K words total to index so as you point out
the index might not be too big. It's the *data* that is big not
the *index*, right?. So I don't think S
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?
Regarding single word qu
Ryan McKinley wrote:
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large
- Solr - Nutch
- Original Message
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-s
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not
Nice description of a use-case. My 2 pennies embedded...
Phillip Farber wrote:
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
to
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text fiel
25 matches
Mail list logo