; > (must not be SPOF!) for shard management.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message
> > > From: marcusherou <[EMAIL PROTECTED]>
> > > To
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message
> > From: marcusherou <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 9, 2008 2:37:19 AM
> > Subject: Re: Solr feasibility with
Thanks Ken.
I will take a look be sure of that :)
Kindly
//Marcus
On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]>
wrote:
> Hi Marcus,
>
> It seems a lot of what you're describing is really similar to MapReduce,
>> so I think Otis' suggestion to look at Hadoop is a good one: i
w, let's see what happens there! :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 5:37:19 PM
> Subject: Re: Solr feasibili
s replicating core Solr (or Nutch)
functionality, then it sucks. Not sure what the outcome will be.
-- Ken
- Original Message
From: Ken Krugler <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, May 9, 2008 4:26:19 PM
Subject: Re: Solr feasibility with terabyte-scale
t -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 4:26:19 PM
> Subject: Re: Solr feasibility with terabyte-scale data
>
> Hi Marcus,
>
> >I
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long "random" number, or 128 "random" bits. At this point
there are no reports of two different datasets that give the same checksum
Hi Marcus,
It seems a lot of what you're describing is really similar to
MapReduce, so I think Otis' suggestion to look at Hadoop is a good
one: it might prevent a lot of headaches and they've already solved
a lot of the tricky problems. There a number of ridiculously sized
projects using it
So our problem is made easier by having complete index
partitionability by a user_id field. That means at one end of the
spectrum, we could have one monolithic index for everyone, while at
the other end of the spectrum we could individual cores for each
user_id.
At the moment, we've gone
ne - Solr - Nutch
- Original Message
> From: marcusherou <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 2:37:19 AM
> Subject: Re: Solr feasibility with terabyte-scale data
>
>
> Hi.
>
> I will as well head into a path li
Cool.
Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to
Hi, we have an index of ~300GB, which is at least approaching the
ballpark you're in.
Lucky for us, to coin a phrase we have an 'embarassingly
partitionable' index so we can just scale out horizontally across
commodity hardware with no problems at all. We're also using the
multicore featu
Hi.
I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents. Each
PROTECTED]
Sent: Wednesday, January 23, 2008 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr feasibility with terabyte-scale data
For sure this is a problem. We have considered some strategies. One might
be to use a dictionary to clean up the OCR but that gets hard for proper
names and
For sure this is a problem. We have considered some strategies. One
might be to use a dictionary to clean up the OCR but that gets hard for
proper names and technical jargon. Another is to use stop words (which
has the unfortunate side effect of making phrase searches like "to be or
not to be
On 22-Jan-08, at 4:20 PM, Phillip Farber wrote:
We would need all 7M ids scored so we could push them through a
filter query to reduce them to a much smaller number on the order
of 100-10,000 representing just those that correspond to items in a
collection.
You could pass the filter to S
Just to add another wrinkle, how clean is your OCR? I've seen it
range from very nice (i.e. 99.9% of the words are actually words) to
horrible (60%+ of the "words" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in e
Otis Gospodnetic wrote:
Hi,
Some quick notes, since it's late here.
- You'll need to wait for SOLR-303 - there is no way even a big machine will be
able to search such a large index in a reasonable amount of time, plus you may
simply not have enough RAM for such a large index.
Are you bas
On 22-Jan-08, at 11:05 AM, Phillip Farber wrote:
Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so
we are talking perhaps 50K words total to index so as you point out
the index might not be too big. It's the *data* that is big not
the *index*, right?. So I don't think S
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size will
result for 7M documents over 50K words where we're talking just 2 fields
per doc: 1 id field and one OCR field of ~1.4M? Ballpark?
Regarding single word qu
Ryan McKinley wrote:
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large
Hi,
Some quick notes, since it's late here.
- You'll need to wait for SOLR-303 - there is no way even a big machine will be
able to search such a large index in a reasonable amount of time, plus you may
simply not have enough RAM for such a large index.
- I'd suggest you wait for Solr 1.3 (or s
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text field,
indexed not
Nice description of a use-case. My 2 pennies embedded...
Phillip Farber wrote:
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
to
24 matches
Mail list logo