RE: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Teruhiko Kurosaka
Christian, This is interesting. I have been always thinking that Solr shouldn't be in the business of parsing; it's responsibility of the Solr client. But what Peter suggested, adding a parsing capability to the Solr as a request handler does make sense. One thing that I noticed this approach ca

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
I cant find the documentation, but I believe apache's max url is 8192, so I would assume a lot of other apps like tomcat and jetty would be similar. I havn't run into any problems yet. Maybe shoot Eric an email and see if he would be interested in adapting the code to take XML as well so that you

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > I am a little confused how you have things setup, so these meta data > > files contain certain information and there may or may not be a pdf, > > xls, doc that it is associated with? > > >

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > I am a little confused how you have things setup, so these meta data > files contain certain information and there may or may not be a pdf, > xls, doc that it is associated with? Yes, you have it right. If that is the case, if it were me I w

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
I am a little confused how you have things setup, so these meta data files contain certain information and there may or may not be a pdf, xls, doc that it is associated with? If that is the case, if it were me I would write something to parse the meta data files, and if there is a binary file asso

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
Pete, Thanks for the great explanation. Thinking it through my process, I am not sure how to use it: I have a bunch of docs that pretty much contain a lot of meta-data, some which include full-text files (.pdf, .ppt, etc...). I use these docs correctly to index/update into Solr. The next step no

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
Installing the patch requires downloading the latest solr via subversion and applying the patch to the source. Eric has updated his patch with various revisions of subversion. To make sure it will compile I suggest getting the revision he lists. As for using the features of this patch. This is

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
There seems to be some code out for Tika now (not packaged/announced yet, but...). Could someone please take a look at it and see if that could fit in? I am eagerly waiting for a reply back from tika-dev, but no luck yet. http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apach

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Peter Manis
Christian, Eric Pugh created implemented this functionality for a project we were doing and has released to code on JIRA. We have had very good results with it. If I can be of any help using it beyond the Java code itself let me know. The last revision I used with it was 552853, so if the build

Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Christian Klinger
Hi Solr Users, i have set up a Solr-Server with a custom Schema. Now i have updated the index with some content form xml-files. Now i try to update the contents of a folder. The folder consits of various document-types (pdf,doc,xls,...). Is there anywhere an howto how can i parse the documents,