Solr would work find for this, your PDF files would have to be interpreted by Tika, but see Data Import handler, FileListEntityProcessor and TikaEntityProcessor. I don't quite think Nutch is the tool here.
You'll be wanting to do highlighting and a couple of other things.... You'll spend some time tweaking results to be what you want, but this is certainly do-able. Best Erick On Tue, Jul 19, 2011 at 1:29 PM, Matthew Twomey <mtwo...@beakstar.com> wrote: > Greetings, > > I'm interesting in having a server based personal document library with a > few specific features and I'm trying to determine what the most appropriate > tools are to build it. > > I have the following content which I wish to include in the archive: > > 1. A smallish collection of technical books in PDF format (around 100) > 2. Many years of several different magazine subscriptions in PDF format > (probably another 100 - 200 PDFs) > 3. Several years of personal documents which were scanned in and converted > to searchable PDF format (300 - 500 documents) > 4. I also have local mirrors of several HTML based reference sites > > I'd like to have the ability to index all of this content and search it from > a web form (so that I and a few other can reach it from multiple locations). > Here are two examples of the functionality I'm looking for: > > Scenario 1. "What was that software that has all the nutritional data and > hooks up to some USDA database? I know I read about it in one of my Linux > Journals last year....." > > Now I'd like to be able to pull up the webform and search for "nutrition > USDA". I'd like to restrict the search to the Linux Journal magazine PDFs > (or refine the results). I'd like results to contain context snippets with > each search result. Finally most importantly, I'd like multiple results per > PDF (or all occurrences). The last one is important so that I can actually > quickly find the right issue (in case there is some advertisement in every > issue for the last year that contains those terms). When I click on the > desired result, the PDF is downloaded by my browser. > > Scenario 2. "How much have I been paying for property taxes for the last > five years again?" (the bills are all scanned in) > > In this case I'd like to search for my property identification number (which > is on the bills) and the results should show all the documents that have it, > with context. Clicking on results downloads the documents. I assume this > example is simple to achieve if example 1 can be done. > > So in general, my question is - can this be done in a fairly straight > forward manner with Solr? Is there a more appropriate tool to be using (e.g. > Nutch?). Also, I have looked high and low for a free, already baked solution > which can do scenario 1 but haven't been able to find something - so if > someone knows of such a thing, please let me know. > > Thanks! > > -Matt >