Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Nicolas Pastorino Tue, 10 Jun 2008 08:39:32 -0700

Thanks a lot, it works fine now, fetching subelements properly.

The only issue left is that the XPath syntax passed in the data-config.xml does not seem to work properly. As an example, processingthe following entity :


<root>
        <contenido id="10097" idioma="cat">
        <antetitulo></antetitulo>
        <titulo>
                This is my title
        </titulo>
        <resumen>
                This is my summary
        </resumen>
        <texto>
                This is the body of my text
        </texto>
        </contenido>
</root>

and trying to fill a solr field with the 'id' attribute of the'contenido' tag with the following config :

<field column="m_guid" xpath="/root/contenido/@id" />

does not seem to work properly.

Thanks a lot for your time already !

Regards,
Nicolas Pastorino

On Jun 10, 2008, at 14:55 , Noble Paul നോബിള്‍नोब्ळ् wrote:

The configuration is fine but for one detail
The documents are to be created for the entity 'oldsearchcontent' not
for the root entity . so add an attribute rootEntity="false" for the
entity 'oldsearchcontentlist' as follows.

   <entity name="oldsearchcontentlist"

url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1"

                               processor="XPathEntityProcessor"
                               forEach="/root/entries/entry"
                               rootEntity="false">

this means that the entity directly under this
('oldsearchcontent')will be treated as the root and documents will be
created for that.
--Noble

On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:

Hello fellow Solr users !

I am in the process of trying to index XML documents in Solr. Iwent for theDataImportHandler approach, which seemed to perfectly suit thisneed. Due tothe large amount of XML documents to be indexed ( ~60MB ), ithought i wouldhardly be possible to feed solr with the concatenation of allthese docs atonce. Hence this small php script i wrote, serving on HTTP thelist of thesedocuments, under the following form ( available from a local URLreplicated

in data-config.xml ) :


<?xml version="1.0" encoding="UTF-8"?>
<root>
<entries>
       <entry>
               <realm>old_search_content</realm>

<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source>

       </entry>
       <entry>
               <realm>old_search_content</realm>

<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source>

       </entry>
       <entry>
               <realm>old_search_content</realm>

<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source>

       </entry>
</entries>
</root>

The idea would be to have one single data-config.xml configurationfile forthe DataImportHandler, which would read the listing presentedabove, andrequest every single subitem and index it. Every subitem has thefollowing

structure :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<root>
       <contenido id="10099" idioma="cat">
               <antetitulo><![CDATA[This is an introduction
text]]></antetitulo>
               <titulo><![CDATA[This is a title]]></titulo>
               <resumen><![CDATA[ This a a summary ]]></resumen>

<texto><![CDATA[This is the body of myarticle<br><br>]]>

               </texto>
               <autor><![CDATA[John Doe]]></autor>
               <fecha><![CDATA[31/10/2001]]></fecha>
               <fuente><![CDATA[]]></fuente>
               <webexterna><![CDATA[]]></webexterna>
               <recursos></recursos>
               <ambitos></ambitos>
       </contenido>
</root>



After struggling for a ( long ) while with different configuration
scenarios, here is a data-config.xml i ended up with :


<dataConfig>
       <dataSource type="HttpDataSource"/>
       <document>
               <entity name="oldsearchcontentlist"
                               pk="m_guid"

url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1"

                               processor="XPathEntityProcessor"
                               forEach="/root/entries/entry">

                       <field column="elementurl"
xpath="/root/entries/entry/source/" />

                       <entity name="oldsearchcontent"
                               pk="m_guid"

url="${oldsearchcontentlist.elementurl}"

                               processor="XPathEntityProcessor"
                               forEach="/root/contenido"
                               transformer="TemplateTransformer">
                               <field column="m_guid"
xpath="/root/contenido/titulo" />
                       </entity>
               </entity>
       </document>
</dataConfig>


As a note, i had to check out Solr's trunk, and patched it with the
following : https://issues.apache.org/jira/browse/SOLR-469 (

https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch ),

and recompiled.
Running the following command :

http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on

tells me that no Document was created at all, and does not throw any
error....here is the full output :


<response>
       <lst name="responseHeader">
               <int name="status">0</int>
               <int name="QTime">39</int>
       </lst>
       <lst name="initArgs">
               <lst name="defaults">
                       <str name="config">data-config.xml</str>
                       <lst name="datasource">
                               <str name="type">HttpDataSource</str>
                       </lst>
               </lst>
       </lst>
       <str name="command">full-import</str>
       <str name="mode">debug</str>
       <null name="documents"/>
               <lst name="verbose-output">
               <lst name="entity:oldsearchcontentlist">
               <lst name="document#1">
                       <str name="query">

http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1

                       </str>
                       <str name="time-taken">0:0:0.23</str>
               </lst>
               </lst>
               </lst>
       <str name="status">idle</str>

<str name="importResponse">Configuration Re-loadedsucessfully</str>

       <lst name="statusMessages">
               <str name="Total Requests made to DataSource">1</str>
               <str name="Total Rows Fetched">0</str>
               <str name="Total Documents Skipped">0</str>

<str name="Full Dump Started">2008-06-10 14:38:56</str>

               <str name="">

Indexing completed. Added/Updated: 0documents.

Deleted 0 documents.
               </str>
               <str name="Committed">2008-06-10 14:38:56</str>
               <str name="Time taken ">0:0:0.32</str>
       </lst>
       <str name="WARNING">

This response format is experimental. It is likelyto change

in the future.
       </str>
</response>

I am sure am i mis doing something, but can not figure out what. Ireadthrough several times all online documentation plus the fullexamples (

slashdot RSS feed ).
I would gladly have feedback from anyone who tried to index HTTP/XML
sources, and got it to work smoothly.

Thanks a million in advance,

Regards,
Nicolas
--
Nicolas Pastorino
eZ Systems ( Western Europe )  |  http://ez.no




--
--Noble Paul


--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no

Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Reply via email to