Pete, Thanks for the great explanation.
Thinking it through my process, I am not sure how to use it: I have a bunch of docs that pretty much contain a lot of meta-data, some which include full-text files (.pdf, .ppt, etc...). I use these docs correctly to index/update into Solr. The next step now is to somehow index the text from the full-text files. One way to think about it is, I could have a placeholder field 'data' and keep it empty for the first pass, and then run update/rich to index the actual full-text, but using the same unique doc id. But this would actually overwrite the doc in the index, won't it? And, there really isn't a 'merge' operation, right? There might be a better way to use this full-text indexing option, schema-wise, say: <richData source="FIELDNAME" dest="FIELDNAME" /> - have a new option richData that will take in a source field name, - validate it's value (valid filename/file), - recognize the file type, - and put the 'data' into another field What do you think? I am not a true Java developer, so not sure if I could do it myself, but only hope that someone else on the project could ;-)... Rao On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > Installing the patch requires downloading the latest solr via > subversion and applying the patch to the source. Eric has updated his > patch with various revisions of subversion. To make sure it will > compile I suggest getting the revision he lists. > > As for using the features of this patch. This is the url that would be > called > > > /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description > > Breaking this down > > You have stream.file which will be the absolute path to the file you > want to index. You then have stream.type which specifies the type of > file, which currently supports pdf, xls, doc, ppt. The next field is > the id, which is where you specify the unique value for the id in your > schema. Example is we had a document reference in a database, and > that id was 103, so we would specify the value 103 to identify which > document it was in the index. Stream.fieldname is the name of the > field in your index that will actually be storing the text from the > document. We had the field 'data' so it would be > stream.fieldname=data in the url. > > The parameter fieldnames is any additional fields in your index that > need to be filled. We were passing a category, description for the > document, a name, and the type. So you just need to specify the names > of the fields. Solr will then look for corresponding parameters with > those names, which you can see at the end of my URL. The values > passed for the additional parameters need to be sent url encoded. > > I'm not a Java programmer so if you have questions about the internals > of the code, definitely direct those to Eric as I cannot help. I have > only implemented it in web applications. If you have any other > questions about the use of the patch I can answer those questions. > > Enjoy! > > - Pete > > On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > > There seems to be some code out for Tika now (not packaged/announced > yet, > > but...). Could someone please take a look at it and see if that could > fit > > in? I am eagerly waiting for a reply back from tika-dev, but no luck > yet. > > > > > http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/ > > > > I see that Eric's patch uses POI (for most of it)...so that's great! I > have > > seen too many duplicated efforts, even in Apache projects alone, and > this is > > one step close to fixing it (other than Tika, which isnt' 'complete' > yet). > > Are there any plans on releasing this patch with Solr dist? Or, any > > instructions on using/installing the patch itself? > > > > Thanks > > Vish > > > > > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > > > Christian, > > > > > > Eric Pugh created implemented this functionality for a project we were > > > doing and has released to code on JIRA. We have had very good results > > > with it. If I can be of any help using it beyond the Java code itself > > > let me know. The last revision I used with it was 552853, so if the > > > build happens to fail you can roll back to that and it will work. > > > > > > https://issues.apache.org/jira/browse/SOLR-284 > > > > > > - Pete > > > > > > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote: > > > > Hi Solr Users, > > > > > > > > i have set up a Solr-Server with a custom Schema. > > > > Now i have updated the index with some content form > > > > xml-files. > > > > > > > > Now i try to update the contents of a folder. > > > > The folder consits of various document-types > > > > (pdf,doc,xls,...). > > > > > > > > Is there anywhere an howto how can i parse the > > > > documents, make an xml of the paresed content > > > > and post it to the solr server? > > > > > > > > Thanks in advance. > > > > > > > > Christian > > > > > > > > > > > > > >