Solr would work find for this, your PDF files would have to be interpreted
by Tika, but see Data Import handler, FileListEntityProcessor and
TikaEntityProcessor. I don't quite think Nutch is the tool here.

You'll be wanting to do highlighting and a couple of other things....

You'll spend some time tweaking results to be what you want, but this
is certainly do-able.

Best
Erick

On Tue, Jul 19, 2011 at 1:29 PM, Matthew Twomey <mtwo...@beakstar.com> wrote:
> Greetings,
>
> I'm interesting in having a server based personal document library with a
> few specific features and I'm trying to determine what the most appropriate
> tools are to build it.
>
> I have the following content which I wish to include in the archive:
>
> 1. A smallish collection of technical books in PDF format (around 100)
> 2. Many years of several different magazine subscriptions in PDF format
> (probably another 100 - 200 PDFs)
> 3. Several years of personal documents which were scanned in and converted
> to searchable PDF format (300 - 500 documents)
> 4. I also have local mirrors of several HTML based reference sites
>
> I'd like to have the ability to index all of this content and search it from
> a web form (so that I and a few other can reach it from multiple locations).
> Here are two examples of the functionality I'm looking for:
>
> Scenario 1. "What was that software that has all the nutritional data and
> hooks up to some USDA database? I know I read about it in one of my Linux
> Journals last year....."
>
> Now I'd like to be able to pull up the webform and search for "nutrition
> USDA". I'd like to restrict the search to the Linux Journal magazine PDFs
> (or refine the results). I'd like results to contain context snippets with
> each search result. Finally most importantly, I'd like multiple results per
> PDF (or all occurrences). The last one is important so that I can actually
> quickly find the right issue (in case there is some advertisement in every
> issue for the last year that contains those terms). When I click on the
> desired result, the PDF is downloaded by my browser.
>
> Scenario 2. "How much have I been paying for property taxes for the last
> five years again?" (the bills are all scanned in)
>
> In this case I'd like to search for my property identification number (which
> is on the bills) and the results should show all the documents that have it,
> with context. Clicking on results downloads the documents. I assume this
> example is simple to achieve if example 1 can be done.
>
> So in general, my question is - can this be done in a fairly straight
> forward manner with Solr? Is there a more appropriate tool to be using (e.g.
> Nutch?). Also, I have looked high and low for a free, already baked solution
> which can do scenario 1 but haven't been able to find something - so if
> someone knows of such a thing, please let me know.
>
> Thanks!
>
> -Matt
>

Reply via email to