You could setup a ManifoldCF job to fetch the XMLs and then setup a new SolrOutputConnection for /solr/update/xslt?tr=myStyleSheet.xsl where myStyleSheet.xsl is the stylesheet to use for that kind of XML. See http://wiki.apache.org/solr/XsltUpdateRequestHandler
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 7. mars 2012, at 14:04, Erick Erickson wrote: > Well, I'm ManifoldCF ignorant, so I'll have to defer on this one.... > > Best > Erick > > On Tue, Mar 6, 2012 at 12:24 PM, Anupam Bhattacharya > <[email protected]> wrote: >> Thanks Erick, for the prompt response, >> >> Both the suggestions will be useful for a one time indexing activity. Since >> DIH will be one time process of indexing the repository thus it is of no >> use in my case.Writing a standalone Java program utilizing SolrJ will again >> be a one time indexing process. >> >> I want to write a separate Handler which will be called by ManifoldCF Job >> to create indexes in SOLR. In my case the repository is Documentum Content >> server. I found some relevant link at this url.. >> https://community.emc.com/docs/DOC-6520 which is quite similar to my >> requirement. >> >> I modified the code to parse the XML and added that into the document >> properties Although this works fine when i try to test it with my CURL >> program with parameters but when the same handler is called from ManifoldCF >> job the job gets terminated within few minutes. Not sure the reason for >> that. The handler is written similar to /update/extract which is >> ExtractingRequestHandler. >> >> Is ExtractingRequestHandler capable of extracting tag name and values using >> some of its defined attributes like capture, captureAttr, extractOnly etc ? >> which can be added into the document indexes.. >> >> >> On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson >> <[email protected]>wrote: >> >>> You might be able to do something with the XSL Transformer step in DIH. >>> >>> It might also be easier to just write a SolrJ program to parse the XML and >>> construct a SolrInputDocument to send to Solr. It's really pretty >>> straightforward. >>> >>> Best >>> Erick >>> >>> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya >>> <[email protected]> wrote: >>>> Hi, >>>> >>>> I am using ManifoldCF to Crawl data from Documentum repository. I am able >>>> to successfully read the metadata/properties for the defined document >>> types >>>> in Documentum using the out-of-the box Documentum Connector in >>> ManifoldCF. >>>> Unfortunately, there is one XML file also present which consists of a >>>> custom XML structure which I need to read and fetch the element values >>> and >>>> add it for indexing in lucene through SOLR. >>>> >>>> Is there any mechanism to index any XML structure document in SOLR ? >>>> >>>> I checked the SOLR CELL framework which support below stucture.. >>>> >>>> <add> >>>> <doc> >>>> <field name="id">9885A004</field> >>>> <field name="name">Canon PowerShot SD500</field> >>>> <field name="category">camera</field> >>>> <field name="features">3x optical zoom</field> >>>> <field name="features">aluminum case</field> >>>> <field name="weight">6.4</field> >>>> <field name="price">329.95</field> >>>> </doc> >>>> <doc> >>>> <field name="id">9885A003</field> >>>> <field name="name">Canon PowerShot SD504</field> >>>> <field name="category">camera1</field> >>>> <field name="features">3x optical zoom1</field> >>>> <field name="features">aluminum case1</field> >>>> <field name="weight">6.41</field> >>>> <field name="price">329.956</field> >>>> </doc> >>>> </add> >>>> >>>> & my Custom XML structure is of the following format.. from which I need >>> to >>>> read *subject *& *abstract *field for indexing. I checked TIKA project >>> but >>>> I couldn't find any useful stuff. >>>> >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <RECORD> >>>> <doc_id>1</doc_id> >>>> <abstract>This is an abstract.</abstract> >>>> <subject>Text Subject</subject> >>>> <availability /> >>>> <indexing> >>>> <index_group></index_group> >>>> <keyterms></keyterms> >>>> <keyterms></keyterms> >>>> </indexing> >>>> <publication_date></publication_date> >>>> <physical_storage /> >>>> <log_entry /> >>>> <legal_category /> >>>> <legal_category_notes /> >>>> <citation_only></citation_only> >>>> <citation_only_desc /> >>>> <export_control /> >>>> <export_control_desc /> >>>> </RECORD> >>>> >>>> Appreciate any help on this. >>>> >>>> Regards >>>> Anupam >>> >> >> >> >> -- >> Thanks & Regards >> Anupam Bhattacharya
