ae...@dot.wi.gov]
> Enviado el: lunes, 15 de agosto de 2011 14:54
> Para: solr-user@lucene.apache.org
> Asunto: RE: ideas for indexing large amount of pdf docs
>
> Note on i: Solr replication provides pretty good clustering support
> out-of-the-box, including replication of m
Note on i: Solr replication provides pretty good clustering support
out-of-the-box, including replication of multiple cores. Read the Wiki on
replication (Google +solr +replication if you don't know where it is).
In my experience, the problem with indexing PDFs is it takes a lot of CPU on
t
t, 13 Aug 2011 15:34:19 -0400
Subject: Re: ideas for indexing large amount of pdf docs
Ahhh, ok, my reply was irrelevant ...
Here's a good write-up on this problem:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
[http://www.lucidimagination.com/content/scaling-lucen
tering in production time.
>
> Best,
>
> Rode.
>
>
> -Original Message-----
>
> From: Erick Erickson
>
> To: solr-user@lucene.apache.org
>
> Date: Sat, 13 Aug 2011 12:13:27 -0400
>
> Subject: Re: ideas for indexing large amount of pdf docs
>
>
You could send PDF for processing using a queue solution like Amazon SQS. Kick
off Amazon instances to process the queue.
Once you process with Tika to text just send the update to Solr.
Bill Bell
Sent from mobile
On Aug 13, 2011, at 10:13 AM, Erick Erickson wrote:
> Yeah, parsing PDF files
dea to minimize this
time all as possible when we entering in production time.
Best,
Rode.
-Original Message-
From: Erick Erickson
To: solr-user@lucene.apache.org
Date: Sat, 13 Aug 2011 12:13:27 -0400
Subject: Re: ideas for indexing large amount of pdf docs
Yeah, parsing PDF
Yeah, parsing PDF files can be pretty resource-intensive, so one solution
is to offload it somewhere else. You can use the Tika libraries in SolrJ
to parse the PDFs on as many clients as you want, just transmitting the
results to Solr for indexing.
HOw are all these docs being submitted? Is this s