Hi Erick, Our app insert the pdf from a backoffice site and the people can search/consult throught a front end site. Both written in php. I've installed a tomcat for solr exclusivelly.
the pdf docs are indexed and not stored using the standard solr.extraction.ExtractingRequestHandler (solr-cell.jar and the other jars included in contrib/extraction dir, you know) in an offline mode (summarizing: the internal users submit the docs; this docs were saved in the server; there is a task that take the docs and put them into the indexer throught a curl utility; when the task finish, the doc is available to the frontend; once more, we use curl utilities to make queries to solr). The problem isn't the process of indexing. The max injection rate can be 1-60 docs at time. The number of pdf docs can be1000, 2000, 10.000,... i don't know exactly... but a lot of them,so many books in a library. But no problem about this, this part of the process runs offline. take a doc, index a doc; take another doc, index another doc, ... The problem is the response time when the number of pdf's grow and grow... How is the better manner, the best way, the fantastic idea to minimize this time all as possible when we entering in production time. Best, Rode. -----Original Message----- From: Erick Erickson <erickerick...@gmail.com> To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this some kind of on-the-fly indexing/searching or what? I'm mostly curious what your projected max ingestion rate is... Best Erick On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova) <r...@libnova.es> wrote: > Hi all, > > I want to ask about the best way to implement a solution for indexing a > large amount of pdf documents between 10-60 MB each one. 100 to 1000 users > connected simultaneously. > > I actually have 1 core of solr 3.3.0 and it works fine for a few number of > pdf docs but I'm afraid about the moment when we enter in production time. > > some possibilities: > > i. clustering. I have no experience in this, so it will be a bad idea to > venture into this. > > ii. multicore solution. make some kind of hash to choose one core at each > query (exact queries) and thus reduce the size of the individual indexes to > consult or to consult all the cores at same time (complex queries). > > iii. do nothing more and wait for the catastrophe in the response times :P > > > Someone with experience can help a bit to decide? > > Thanks a lot in advance. >