Question on the appropriate software

2011-07-19 Thread Matthew Twomey

Greetings,

I'm interesting in having a server based personal document library with 
a few specific features and I'm trying to determine what the most 
appropriate tools are to build it.


I have the following content which I wish to include in the archive:

1. A smallish collection of technical books in PDF format (around 100)
2. Many years of several different magazine subscriptions in PDF format 
(probably another 100 - 200 PDFs)
3. Several years of personal documents which were scanned in and 
converted to searchable PDF format (300 - 500 documents)

4. I also have local mirrors of several HTML based reference sites

I'd like to have the ability to index all of this content and search it 
from a web form (so that I and a few other can reach it from multiple 
locations). Here are two examples of the functionality I'm looking for:


Scenario 1. "What was that software that has all the nutritional data 
and hooks up to some USDA database? I know I read about it in one of my 
Linux Journals last year."


Now I'd like to be able to pull up the webform and search for "nutrition 
USDA". I'd like to restrict the search to the Linux Journal magazine 
PDFs (or refine the results). I'd like results to contain context 
snippets with each search result. Finally most importantly, I'd like 
multiple results per PDF (or all occurrences). The last one is important 
so that I can actually quickly find the right issue (in case there is 
some advertisement in every issue for the last year that contains those 
terms). When I click on the desired result, the PDF is downloaded by my 
browser.


Scenario 2. "How much have I been paying for property taxes for the last 
five years again?" (the bills are all scanned in)


In this case I'd like to search for my property identification number 
(which is on the bills) and the results should show all the documents 
that have it, with context. Clicking on results downloads the documents. 
I assume this example is simple to achieve if example 1 can be done.


So in general, my question is - can this be done in a fairly straight 
forward manner with Solr? Is there a more appropriate tool to be using 
(e.g. Nutch?). Also, I have looked high and low for a free, already 
baked solution which can do scenario 1 but haven't been able to find 
something - so if someone knows of such a thing, please let me know.


Thanks!

-Matt


Solr not returning results for some key words

2011-07-20 Thread Matthew Twomey

Greetings,

I'm having trouble getting Solr to return results for key words that I 
know for sure are in the index. As a test, I've indexed a PDF of a book 
on Java. I'm trying to search the index for 
"UnsupportedOperationException" but I get no results. I can "see" it in 
the index though:


#
[root@myhost apache-solr-1.4.1]# strings 
example/solr/data/index/_0.fdt|grep UnsupportedOperationException

UnsupportedOperationException if the iterator returned by this collec-
throw new UnsupportedOperationException();
UnsupportedOperationException Object does not support methodCHAPTER 
9 EXCEPTIONS

UnsupportedOperationException, 87,
[root@myhost apache-solr-1.4.1]#
#

On the other hand, if I search the index for the word "support" (which 
is also contained in the grep above), I get a hit on this document. 
Furthermore, if I search on "support" and include highlighted snippets, 
I can see the word "UnsupportedOperationException" right in there in the 
highlight results!


#
of an object has
been detected where it is prohibited
UnsupportedOperationException Object does not support
#

So why do I get no hits when I search for it?

This happens with many different key words. Any thoughts on how I can 
trouble shoot this or ideas on why it's not working properly?


Thanks,

-Matt


Re: Solr not returning results for some key words

2011-07-20 Thread Matthew Twomey
Ok, apparently I'm not the first to have fallen prey to maxFieldLength 
gotcha:


http://lucene.472066.n3.nabble.com/Solr-ignoring-maxFieldLength-td473263.html

All fixed now.

-Matt

On 07/20/2011 07:13 PM, Matthew Twomey wrote:

Greetings,

I'm having trouble getting Solr to return results for key words that I 
know for sure are in the index. As a test, I've indexed a PDF of a 
book on Java. I'm trying to search the index for 
"UnsupportedOperationException" but I get no results. I can "see" it 
in the index though:


#
[root@myhost apache-solr-1.4.1]# strings 
example/solr/data/index/_0.fdt|grep UnsupportedOperationException

UnsupportedOperationException if the iterator returned by this collec-
throw new UnsupportedOperationException();
UnsupportedOperationException Object does not support method
CHAPTER 9 EXCEPTIONS

UnsupportedOperationException, 87,
[root@myhost apache-solr-1.4.1]#
#

On the other hand, if I search the index for the word "support" (which 
is also contained in the grep above), I get a hit on this document. 
Furthermore, if I search on "support" and include highlighted 
snippets, I can see the word "UnsupportedOperationException" right in 
there in the highlight results!


#
of an object has
been detected where it is prohibited
UnsupportedOperationException Object does not support
#

So why do I get no hits when I search for it?

This happens with many different key words. Any thoughts on how I can 
trouble shoot this or ideas on why it's not working properly?


Thanks,

-Matt




Re: Question on the appropriate software

2011-07-20 Thread Matthew Twomey
Excellent, thanks for the confirmation Erik. I've started working with 
Solr (just getting my feet wet at this point).


-Matt

On 07/20/2011 05:38 PM, Erick Erickson wrote:

Solr would work find for this, your PDF files would have to be interpreted
by Tika, but see Data Import handler, FileListEntityProcessor and
TikaEntityProcessor. I don't quite think Nutch is the tool here.

You'll be wanting to do highlighting and a couple of other things

You'll spend some time tweaking results to be what you want, but this
is certainly do-able.

Best
Erick

On Tue, Jul 19, 2011 at 1:29 PM, Matthew Twomey  wrote:

Greetings,

I'm interesting in having a server based personal document library with a
few specific features and I'm trying to determine what the most appropriate
tools are to build it.

I have the following content which I wish to include in the archive:

1. A smallish collection of technical books in PDF format (around 100)
2. Many years of several different magazine subscriptions in PDF format
(probably another 100 - 200 PDFs)
3. Several years of personal documents which were scanned in and converted
to searchable PDF format (300 - 500 documents)
4. I also have local mirrors of several HTML based reference sites

I'd like to have the ability to index all of this content and search it from
a web form (so that I and a few other can reach it from multiple locations).
Here are two examples of the functionality I'm looking for:

Scenario 1. "What was that software that has all the nutritional data and
hooks up to some USDA database? I know I read about it in one of my Linux
Journals last year."

Now I'd like to be able to pull up the webform and search for "nutrition
USDA". I'd like to restrict the search to the Linux Journal magazine PDFs
(or refine the results). I'd like results to contain context snippets with
each search result. Finally most importantly, I'd like multiple results per
PDF (or all occurrences). The last one is important so that I can actually
quickly find the right issue (in case there is some advertisement in every
issue for the last year that contains those terms). When I click on the
desired result, the PDF is downloaded by my browser.

Scenario 2. "How much have I been paying for property taxes for the last
five years again?" (the bills are all scanned in)

In this case I'd like to search for my property identification number (which
is on the bills) and the results should show all the documents that have it,
with context. Clicking on results downloads the documents. I assume this
example is simple to achieve if example 1 can be done.

So in general, my question is - can this be done in a fairly straight
forward manner with Solr? Is there a more appropriate tool to be using (e.g.
Nutch?). Also, I have looked high and low for a free, already baked solution
which can do scenario 1 but haven't been able to find something - so if
someone knows of such a thing, please let me know.

Thanks!

-Matt