RE: ideas for indexing large amount of pdf docs

2011-08-16 Thread Rode González
ae...@dot.wi.gov] > Enviado el: lunes, 15 de agosto de 2011 14:54 > Para: solr-user@lucene.apache.org > Asunto: RE: ideas for indexing large amount of pdf docs > > Note on i: Solr replication provides pretty good clustering support > out-of-the-box, including replication of m

RE: ideas for indexing large amount of pdf docs

2011-08-15 Thread Jaeger, Jay - DOT
{ print "Query: lnamesyn:$lname AND fnamesyn:$fname$fuzzy"; print $response->content(); } print "POST for $fname $lname completed, HTTP status=" . $response->code . "\n"; } $elapsed = time() - $starttime; $average

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
t, 13 Aug 2011 15:34:19 -0400 Subject: Re: ideas for indexing large amount of pdf docs Ahhh, ok, my reply was irrelevant ... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr [http://www.lucidimagination.com/content/scaling-lucen

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
tering in production time. > > Best, > > Rode. > > > -Original Message----- > > From: Erick Erickson > > To: solr-user@lucene.apache.org > > Date: Sat, 13 Aug 2011 12:13:27 -0400 > > Subject: Re: ideas for indexing large amount of pdf docs > >

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell
You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson wrote: > Yeah, parsing PDF files

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
dea to minimize this time all as possible when we entering in production time. Best, Rode. -Original Message- From: Erick Erickson To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this s

ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the mome