Re: ideas for indexing large amount of pdf docs

Rode Gonzalez (libnova) Sat, 13 Aug 2011 11:14:02 -0700

Hi Erick, 

Our app insert the pdf from a backoffice site and the people can 
search/consult throught a front end site. Both written in php. I've 
installed a tomcat for solr exclusivelly.

the pdf docs are indexed and not stored using the standard 
solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars 
included in contrib/extraction dir, you know) in an offline mode 
(summarizing: the internal users submit the docs; this docs were saved in 
the server; there is a task that take the docs and put them into the indexer 
throught a curl utility; when the task finish, the doc is available to the 
frontend; once more, we use curl utilities to make queries to solr).

The problem isn't the process of indexing. The max injection rate can be 
1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i 
don't know exactly... but a lot of them,so many books in a library.

But no problem about this, this part of the process runs offline. take a 
doc, index a doc; take another doc, index another doc, ...

The problem is the response time when the number of pdf's grow and grow... 
How is the better manner, the best way, the fantastic idea to minimize this 
time all as possible when we entering in production time.

Best,

Rode.

-----Original Message-----

From: Erick Erickson <erickerick...@gmail.com>

To: solr-user@lucene.apache.org

Date: Sat, 13 Aug 2011 12:13:27 -0400

Subject: Re: ideas for indexing large amount of pdf docs

Yeah, parsing PDF files can be pretty resource-intensive, so one solution

is to offload it somewhere else. You can use the Tika libraries in SolrJ

to parse the PDFs on as many clients as you want, just transmitting the

results to Solr for indexing.

HOw are all these docs being submitted? Is this some kind of

on-the-fly indexing/searching or what? I'm mostly curious what

your projected max ingestion rate is...

Best

Erick

On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)

<r...@libnova.es> wrote:

> Hi all,

>

> I want to ask about the best way to implement a solution for indexing a

> large amount of pdf documents between 10-60 MB each one. 100 to 1000 users

> connected simultaneously.

>

> I actually have 1 core of solr 3.3.0 and it works fine for a few number of

> pdf docs but I'm afraid about the moment when we enter in production time.

>

> some possibilities:

>

> i. clustering. I have no experience in this, so it will be a bad idea to

> venture into this.

>

> ii. multicore solution. make some kind of hash to choose one core at each

> query (exact queries) and thus reduce the size of the individual indexes 
to

> consult or to consult all the cores at same time (complex queries).

>

> iii. do nothing more and wait for the catastrophe in the response times :P

>

>

> Someone with experience can help a bit to decide?

>

> Thanks a lot in advance.

>

Re: ideas for indexing large amount of pdf docs

Reply via email to