Well, I'm ManifoldCF ignorant, so I'll have to defer on this one.... Best Erick
On Tue, Mar 6, 2012 at 12:24 PM, Anupam Bhattacharya <anupam...@gmail.com> wrote: > Thanks Erick, for the prompt response, > > Both the suggestions will be useful for a one time indexing activity. Since > DIH will be one time process of indexing the repository thus it is of no > use in my case.Writing a standalone Java program utilizing SolrJ will again > be a one time indexing process. > > I want to write a separate Handler which will be called by ManifoldCF Job > to create indexes in SOLR. In my case the repository is Documentum Content > server. I found some relevant link at this url.. > https://community.emc.com/docs/DOC-6520 which is quite similar to my > requirement. > > I modified the code to parse the XML and added that into the document > properties Although this works fine when i try to test it with my CURL > program with parameters but when the same handler is called from ManifoldCF > job the job gets terminated within few minutes. Not sure the reason for > that. The handler is written similar to /update/extract which is > ExtractingRequestHandler. > > Is ExtractingRequestHandler capable of extracting tag name and values using > some of its defined attributes like capture, captureAttr, extractOnly etc ? > which can be added into the document indexes.. > > > On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> You might be able to do something with the XSL Transformer step in DIH. >> >> It might also be easier to just write a SolrJ program to parse the XML and >> construct a SolrInputDocument to send to Solr. It's really pretty >> straightforward. >> >> Best >> Erick >> >> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya >> <anupam...@gmail.com> wrote: >> > Hi, >> > >> > I am using ManifoldCF to Crawl data from Documentum repository. I am able >> > to successfully read the metadata/properties for the defined document >> types >> > in Documentum using the out-of-the box Documentum Connector in >> ManifoldCF. >> > Unfortunately, there is one XML file also present which consists of a >> > custom XML structure which I need to read and fetch the element values >> and >> > add it for indexing in lucene through SOLR. >> > >> > Is there any mechanism to index any XML structure document in SOLR ? >> > >> > I checked the SOLR CELL framework which support below stucture.. >> > >> > <add> >> > <doc> >> > <field name="id">9885A004</field> >> > <field name="name">Canon PowerShot SD500</field> >> > <field name="category">camera</field> >> > <field name="features">3x optical zoom</field> >> > <field name="features">aluminum case</field> >> > <field name="weight">6.4</field> >> > <field name="price">329.95</field> >> > </doc> >> > <doc> >> > <field name="id">9885A003</field> >> > <field name="name">Canon PowerShot SD504</field> >> > <field name="category">camera1</field> >> > <field name="features">3x optical zoom1</field> >> > <field name="features">aluminum case1</field> >> > <field name="weight">6.41</field> >> > <field name="price">329.956</field> >> > </doc> >> > </add> >> > >> > & my Custom XML structure is of the following format.. from which I need >> to >> > read *subject *& *abstract *field for indexing. I checked TIKA project >> but >> > I couldn't find any useful stuff. >> > >> > <?xml version="1.0" encoding="UTF-8"?> >> > <RECORD> >> > <doc_id>1</doc_id> >> > <abstract>This is an abstract.</abstract> >> > <subject>Text Subject</subject> >> > <availability /> >> > <indexing> >> > <index_group></index_group> >> > <keyterms></keyterms> >> > <keyterms></keyterms> >> > </indexing> >> > <publication_date></publication_date> >> > <physical_storage /> >> > <log_entry /> >> > <legal_category /> >> > <legal_category_notes /> >> > <citation_only></citation_only> >> > <citation_only_desc /> >> > <export_control /> >> > <export_control_desc /> >> > </RECORD> >> > >> > Appreciate any help on this. >> > >> > Regards >> > Anupam >> > > > > -- > Thanks & Regards > Anupam Bhattacharya