Re: How to Index Custom XML structure

Erick Erickson Wed, 07 Mar 2012 05:05:16 -0800

Well, I'm ManifoldCF ignorant, so I'll have to defer on this one....

Best
Erick


On Tue, Mar 6, 2012 at 12:24 PM, Anupam Bhattacharya
<anupam...@gmail.com> wrote:
> Thanks Erick, for the prompt response,
>
> Both the suggestions will be useful for a one time indexing activity. Since
> DIH will be one time process of indexing the repository thus it is of no
> use in my case.Writing a standalone Java program utilizing SolrJ will again
> be a one time indexing process.
>
> I want to write a separate Handler which will be called by ManifoldCF Job
> to create indexes in SOLR. In my case the repository is Documentum Content
> server. I found some relevant link at this url..
> https://community.emc.com/docs/DOC-6520 which is quite similar to my
> requirement.
>
> I modified the code to parse the XML and added that into the document
> properties Although this works fine when i try to test it with my CURL
> program with parameters but when the same handler is called from ManifoldCF
> job the job gets terminated within few minutes. Not sure the reason for
> that. The handler is written similar to /update/extract which is
> ExtractingRequestHandler.
>
> Is ExtractingRequestHandler capable of extracting tag name and values using
> some of its defined attributes like capture, captureAttr, extractOnly etc ?
> which can be added into the document indexes..
>
>
> On Tue, Feb 28, 2012 at 8:26 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> You might be able to do something with the XSL Transformer step in DIH.
>>
>> It might also be easier to just write a SolrJ program to parse the XML and
>> construct a SolrInputDocument to send to Solr. It's really pretty
>> straightforward.
>>
>> Best
>> Erick
>>
>> On Sun, Feb 26, 2012 at 11:31 PM, Anupam Bhattacharya
>> <anupam...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am using ManifoldCF to Crawl data from Documentum repository. I am able
>> > to successfully read the metadata/properties for the defined document
>> types
>> > in Documentum using the out-of-the box Documentum Connector in
>> ManifoldCF.
>> > Unfortunately, there is one XML file also present which consists of a
>> > custom XML structure which I need to read and fetch the element values
>> and
>> > add it for indexing in lucene through SOLR.
>> >
>> > Is there any mechanism to index any XML structure document in SOLR ?
>> >
>> > I checked the SOLR CELL framework which support below stucture..
>> >
>> > <add>
>> >  <doc>
>> >    <field name="id">9885A004</field>
>> >    <field name="name">Canon PowerShot SD500</field>
>> >    <field name="category">camera</field>
>> >    <field name="features">3x optical zoom</field>
>> >    <field name="features">aluminum case</field>
>> >    <field name="weight">6.4</field>
>> >    <field name="price">329.95</field>
>> >  </doc>
>> >  <doc>
>> >    <field name="id">9885A003</field>
>> >    <field name="name">Canon PowerShot SD504</field>
>> >    <field name="category">camera1</field>
>> >    <field name="features">3x optical zoom1</field>
>> >    <field name="features">aluminum case1</field>
>> >    <field name="weight">6.41</field>
>> >    <field name="price">329.956</field>
>> >  </doc>
>> > </add>
>> >
>> > & my Custom XML structure is of the following format.. from which I need
>> to
>> > read *subject *& *abstract *field for indexing. I checked TIKA project
>> but
>> > I couldn't find any useful stuff.
>> >
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <RECORD>
>> > <doc_id>1</doc_id>
>> > <abstract>This is an abstract.</abstract>
>> > <subject>Text Subject</subject>
>> > <availability />
>> > <indexing>
>> > <index_group></index_group>
>> > <keyterms></keyterms>
>> > <keyterms></keyterms>
>> > </indexing>
>> > <publication_date></publication_date>
>> > <physical_storage />
>> > <log_entry />
>> > <legal_category />
>> > <legal_category_notes />
>> > <citation_only></citation_only>
>> > <citation_only_desc />
>> > <export_control />
>> > <export_control_desc />
>> > </RECORD>
>> >
>> > Appreciate any help on this.
>> >
>> > Regards
>> > Anupam
>>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya

Re: How to Index Custom XML structure

Reply via email to