Re: How to Index Custom XML structure

Jan Høydahl Fri, 09 Mar 2012 16:18:47 -0800

You could setup a ManifoldCF job to fetch the XMLs and then setup a new 
SolrOutputConnection for /solr/update/xslt?tr=myStyleSheet.xsl where 
myStyleSheet.xsl is the stylesheet to use for that kind of XML. See 
http://wiki.apache.org/solr/XsltUpdateRequestHandler


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 7. mars 2012, at 14:04, Erick Erickson wrote:

> Well, I'm ManifoldCF ignorant, so I'll have to defer on this one....
> 
> Best
> Erick
> 
> On Tue, Mar 6, 2012 at 12:24 PM, Anupam Bhattacharya
> <[email protected]> wrote:
>> Thanks Erick, for the prompt response,
>> 
>> Both the suggestions will be useful for a one time indexing activity. Since
>> DIH will be one time process of indexing the repository thus it is of no
>> use in my case.Writing a standalone Java program utilizing SolrJ will again
>> be a one time indexing process.
>> 
>> I want to write a separate Handler which will be called by ManifoldCF Job
>> to create indexes in SOLR. In my case the repository is Documentum Content
>> server. I found some relevant link at this url..
>> https://community.emc.com/docs/DOC-6520 which is quite similar to my
>> requirement.
>> 
>> I modified the code to parse the XML and added that into the document
>> properties Although this works fine when i try to test it with my CURL
>> program with parameters but when the same handler is called from ManifoldCF
>> job the job gets terminated within few minutes. Not sure the reason for
>> that. The handler is written similar to /update/extract which is
>> ExtractingRequestHandler.
>> 
>> Is ExtractingRequestHandler capable of extracting tag name and values using
>> some of its defined attributes like capture, captureAttr, extractOnly etc ?
>> which can be added into the document indexes..
>> 
>> 
>> On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson 
>> <[email protected]>wrote:
>> 
>>> You might be able to do something with the XSL Transformer step in DIH.
>>> 
>>> It might also be easier to just write a SolrJ program to parse the XML and
>>> construct a SolrInputDocument to send to Solr. It's really pretty
>>> straightforward.
>>> 
>>> Best
>>> Erick
>>> 
>>> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
>>> <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> I am using ManifoldCF to Crawl data from Documentum repository. I am able
>>>> to successfully read the metadata/properties for the defined document
>>> types
>>>> in Documentum using the out-of-the box Documentum Connector in
>>> ManifoldCF.
>>>> Unfortunately, there is one XML file also present which consists of a
>>>> custom XML structure which I need to read and fetch the element values
>>> and
>>>> add it for indexing in lucene through SOLR.
>>>> 
>>>> Is there any mechanism to index any XML structure document in SOLR ?
>>>> 
>>>> I checked the SOLR CELL framework which support below stucture..
>>>> 
>>>> <add>
>>>>  <doc>
>>>>    <field name="id">9885A004</field>
>>>>    <field name="name">Canon PowerShot SD500</field>
>>>>    <field name="category">camera</field>
>>>>    <field name="features">3x optical zoom</field>
>>>>    <field name="features">aluminum case</field>
>>>>    <field name="weight">6.4</field>
>>>>    <field name="price">329.95</field>
>>>>  </doc>
>>>>  <doc>
>>>>    <field name="id">9885A003</field>
>>>>    <field name="name">Canon PowerShot SD504</field>
>>>>    <field name="category">camera1</field>
>>>>    <field name="features">3x optical zoom1</field>
>>>>    <field name="features">aluminum case1</field>
>>>>    <field name="weight">6.41</field>
>>>>    <field name="price">329.956</field>
>>>>  </doc>
>>>> </add>
>>>> 
>>>> & my Custom XML structure is of the following format.. from which I need
>>> to
>>>> read *subject *& *abstract *field for indexing. I checked TIKA project
>>> but
>>>> I couldn't find any useful stuff.
>>>> 
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <RECORD>
>>>> <doc_id>1</doc_id>
>>>> <abstract>This is an abstract.</abstract>
>>>> <subject>Text Subject</subject>
>>>> <availability />
>>>> <indexing>
>>>> <index_group></index_group>
>>>> <keyterms></keyterms>
>>>> <keyterms></keyterms>
>>>> </indexing>
>>>> <publication_date></publication_date>
>>>> <physical_storage />
>>>> <log_entry />
>>>> <legal_category />
>>>> <legal_category_notes />
>>>> <citation_only></citation_only>
>>>> <citation_only_desc />
>>>> <export_control />
>>>> <export_control_desc />
>>>> </RECORD>
>>>> 
>>>> Appreciate any help on this.
>>>> 
>>>> Regards
>>>> Anupam
>>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards
>> Anupam Bhattacharya

Re: How to Index Custom XML structure

Reply via email to