Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Vish D. Tue, 21 Aug 2007 09:44:53 -0700

Pete,

Thanks for the great explanation.


Thinking it through my process, I am not sure how to use it:

I have a bunch of docs that pretty much contain a lot of meta-data, some
which include full-text files (.pdf, .ppt, etc...). I use these docs
correctly to index/update into Solr. The next step now is to somehow index
the text from the full-text files. One way to think about it is, I could
have a placeholder field 'data' and keep it empty for the first pass, and
then run update/rich to index the actual full-text, but using the same
unique doc id. But this would actually overwrite the doc in the index, won't
it? And, there really isn't a 'merge' operation, right?

There might be a better way to use this full-text indexing option,
schema-wise, say:
<richData source="FIELDNAME" dest="FIELDNAME" />
- have a new option richData that will take in a source field name,
- validate it's value (valid filename/file),
- recognize the file type,
- and put the 'data' into another field

What do you think?  I am not a true Java developer, so not sure if I could
do it myself, but only hope that someone else on the project could ;-)...

Rao

On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> Installing the patch requires downloading the latest solr via
> subversion and applying the patch to the source.  Eric has updated his
> patch with various revisions of subversion.  To make sure it will
> compile I suggest getting the revision he lists.
>
> As for using the features of this patch.  This is the url that would be
> called
>
>
> /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description
>
> Breaking this down
>
> You have stream.file which will be the absolute path to the file you
> want to index.  You then have stream.type which specifies the type of
> file, which currently supports pdf, xls, doc, ppt.  The next field is
> the id, which is where you specify the unique value for the id in your
> schema.  Example is we had a document reference in a database, and
> that id was 103, so we would specify the value 103 to identify which
> document it was in the index.  Stream.fieldname is the name of the
> field in your index that will actually be storing the text from the
> document.  We had the field 'data' so it would be
> stream.fieldname=data in the url.
>
> The parameter fieldnames is any additional fields in your index that
> need to be filled.  We were passing a category, description for the
> document, a name, and the type.  So you just need to specify the names
> of the fields.  Solr will then look for corresponding parameters with
> those names, which you can see at the end of my URL.  The values
> passed for the additional parameters need to be sent url encoded.
>
> I'm not a Java programmer so if you have questions about the internals
> of the code, definitely direct those to Eric as I cannot help.  I have
> only implemented it in web applications.  If you have any other
> questions about the use of the patch I can answer those questions.
>
> Enjoy!
>
> - Pete
>
> On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> > There seems to be some code out for Tika now (not packaged/announced
> yet,
> > but...). Could someone please take a look at it and see if that could
> fit
> > in? I am eagerly waiting for a reply back from tika-dev, but no luck
> yet.
> >
> >
> http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/
> >
> > I see that Eric's patch uses POI (for most of it)...so that's great! I
> have
> > seen too many duplicated efforts, even in Apache projects alone, and
> this is
> > one step close to fixing it (other than Tika, which isnt' 'complete'
> yet).
> > Are there any plans on releasing this patch with Solr dist? Or, any
> > instructions on using/installing the patch itself?
> >
> > Thanks
> > Vish
> >
> >
> > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > >
> > > Christian,
> > >
> > > Eric Pugh created implemented this functionality for a project we were
> > > doing and has released to code on JIRA.  We have had very good results
> > > with it.  If I can be of any help using it beyond the Java code itself
> > > let me know.  The last revision I used with it was 552853, so if the
> > > build happens to fail you can roll back to that and it will work.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-284
> > >
> > > - Pete
> > >
> > > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote:
> > > > Hi Solr Users,
> > > >
> > > > i have set up a Solr-Server with a custom Schema.
> > > > Now i have updated the index with some content form
> > > > xml-files.
> > > >
> > > > Now i try to update the contents of a folder.
> > > > The folder consits of various document-types
> > > > (pdf,doc,xls,...).
> > > >
> > > > Is there anywhere an howto how can i parse the
> > > > documents, make an xml of the paresed content
> > > > and post it to the solr server?
> > > >
> > > > Thanks in advance.
> > > >
> > > > Christian
> > > >
> > > >
> > >
> >
>

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Reply via email to