Re: How to Index Custom XML structure

Anupam Bhattacharya Tue, 06 Mar 2012 09:25:24 -0800

Thanks Erick, for the prompt response,

Both the suggestions will be useful for a one time indexing activity. Since
DIH will be one time process of indexing the repository thus it is of no
use in my case.Writing a standalone Java program utilizing SolrJ will again
be a one time indexing process.


I want to write a separate Handler which will be called by ManifoldCF Job
to create indexes in SOLR. In my case the repository is Documentum Content
server. I found some relevant link at this url..
https://community.emc.com/docs/DOC-6520 which is quite similar to my
requirement.

I modified the code to parse the XML and added that into the document
properties Although this works fine when i try to test it with my CURL
program with parameters but when the same handler is called from ManifoldCF
job the job gets terminated within few minutes. Not sure the reason for
that. The handler is written similar to /update/extract which is
ExtractingRequestHandler.

Is ExtractingRequestHandler capable of extracting tag name and values using
some of its defined attributes like capture, captureAttr, extractOnly etc ?
which can be added into the document indexes..


On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> You might be able to do something with the XSL Transformer step in DIH.
>
> It might also be easier to just write a SolrJ program to parse the XML and
> construct a SolrInputDocument to send to Solr. It's really pretty
> straightforward.
>
> Best
> Erick
>
> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
> <anupam...@gmail.com> wrote:
> > Hi,
> >
> > I am using ManifoldCF to Crawl data from Documentum repository. I am able
> > to successfully read the metadata/properties for the defined document
> types
> > in Documentum using the out-of-the box Documentum Connector in
> ManifoldCF.
> > Unfortunately, there is one XML file also present which consists of a
> > custom XML structure which I need to read and fetch the element values
> and
> > add it for indexing in lucene through SOLR.
> >
> > Is there any mechanism to index any XML structure document in SOLR ?
> >
> > I checked the SOLR CELL framework which support below stucture..
> >
> > <add>
> >  <doc>
> >    <field name="id">9885A004</field>
> >    <field name="name">Canon PowerShot SD500</field>
> >    <field name="category">camera</field>
> >    <field name="features">3x optical zoom</field>
> >    <field name="features">aluminum case</field>
> >    <field name="weight">6.4</field>
> >    <field name="price">329.95</field>
> >  </doc>
> >  <doc>
> >    <field name="id">9885A003</field>
> >    <field name="name">Canon PowerShot SD504</field>
> >    <field name="category">camera1</field>
> >    <field name="features">3x optical zoom1</field>
> >    <field name="features">aluminum case1</field>
> >    <field name="weight">6.41</field>
> >    <field name="price">329.956</field>
> >  </doc>
> > </add>
> >
> > & my Custom XML structure is of the following format.. from which I need
> to
> > read *subject *& *abstract *field for indexing. I checked TIKA project
> but
> > I couldn't find any useful stuff.
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <RECORD>
> > <doc_id>1</doc_id>
> > <abstract>This is an abstract.</abstract>
> > <subject>Text Subject</subject>
> > <availability />
> > <indexing>
> > <index_group></index_group>
> > <keyterms></keyterms>
> > <keyterms></keyterms>
> > </indexing>
> > <publication_date></publication_date>
> > <physical_storage />
> > <log_entry />
> > <legal_category />
> > <legal_category_notes />
> > <citation_only></citation_only>
> > <citation_only_desc />
> > <export_control />
> > <export_control_desc />
> > </RECORD>
> >
> > Appreciate any help on this.
> >
> > Regards
> > Anupam
>



-- 
Thanks & Regards
Anupam Bhattacharya

Re: How to Index Custom XML structure

Reply via email to