Hello fellow Solr users !

I am in the process of trying to index XML documents in Solr. I went for the DataImportHandler approach, which seemed to perfectly suit this need. Due to the large amount of XML documents to be indexed ( ~60MB ), i thought i would hardly be possible to feed solr with the concatenation of all these docs at once. Hence this small php script i wrote, serving on HTTP the list of these documents, under the following form ( available from a local URL replicated in data- config.xml ) :


<?xml version="1.0" encoding="UTF-8"?>
<root>
<entries>
        <entry>
                <realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/ 10098.xml</source>
        </entry>
        <entry>
                <realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/ 10099.xml</source>
        </entry>
        <entry>
                <realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/ all_in_one.xml</source>
        </entry>  
</entries>
</root>


The idea would be to have one single data-config.xml configuration file for the DataImportHandler, which would read the listing presented above, and request every single subitem and index it. Every subitem has the following structure :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<root>
        <contenido id="10099" idioma="cat">
<antetitulo><![CDATA[This is an introduction text]] ></antetitulo>
                <titulo><![CDATA[This is a title]]></titulo>
                <resumen><![CDATA[ This a a summary ]]></resumen>
<texto><![CDATA[This is the body of my article<br><br>]]>
                </texto>
                <autor><![CDATA[John Doe]]></autor>
                <fecha><![CDATA[31/10/2001]]></fecha>
                <fuente><![CDATA[]]></fuente>
                <webexterna><![CDATA[]]></webexterna>
                <recursos></recursos>
                <ambitos></ambitos>
        </contenido>
</root>



After struggling for a ( long ) while with different configuration scenarios, here is a data-config.xml i ended up with :


<dataConfig>
        <dataSource type="HttpDataSource"/>
        <document>
                <entity name="oldsearchcontentlist"
                                pk="m_guid"
url="http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&amp;urlsonly=1"
                                processor="XPathEntityProcessor"
                                forEach="/root/entries/entry">
                                
                        <field column="elementurl" 
xpath="/root/entries/entry/source/" />
                                                
                        <entity name="oldsearchcontent"
                                pk="m_guid"
                                url="${oldsearchcontentlist.elementurl}"
                                processor="XPathEntityProcessor"
                                forEach="/root/contenido"
                                transformer="TemplateTransformer">
                                <field column="m_guid" 
xpath="/root/contenido/titulo" />
                        </entity>
                </entity>
        </document>
</dataConfig>


As a note, i had to check out Solr's trunk, and patched it with the following : https://issues.apache.org/jira/browse/SOLR-469 ( https:// issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch ), and recompiled.
Running the following command :
http://localhost:8983/solr/dataimport?command=full- import&verbose=on&debug=on tells me that no Document was created at all, and does not throw any error....here is the full output :


<response>
        <lst name="responseHeader">
                <int name="status">0</int>
                <int name="QTime">39</int>
        </lst>
        <lst name="initArgs">
                <lst name="defaults">
                        <str name="config">data-config.xml</str>
                        <lst name="datasource">
                                <str name="type">HttpDataSource</str>
                        </lst>
                </lst>
        </lst>
        <str name="command">full-import</str>
        <str name="mode">debug</str>
        <null name="documents"/>
                <lst name="verbose-output">
                <lst name="entity:oldsearchcontentlist">
                <lst name="document#1">
                        <str name="query">
http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&urlsonly=1
                        </str>
                        <str name="time-taken">0:0:0.23</str>
                </lst>
                </lst>
                </lst>
        <str name="status">idle</str>
        <str name="importResponse">Configuration Re-loaded sucessfully</str>
        <lst name="statusMessages">
                <str name="Total Requests made to DataSource">1</str>
                <str name="Total Rows Fetched">0</str>
                <str name="Total Documents Skipped">0</str>
                <str name="Full Dump Started">2008-06-10 14:38:56</str>
                <str name="">
                        Indexing completed. Added/Updated: 0 documents. Deleted 
0 documents.
                </str>
                <str name="Committed">2008-06-10 14:38:56</str>
                <str name="Time taken ">0:0:0.32</str>
        </lst>
        <str name="WARNING">
This response format is experimental. It is likely to change in the future.
        </str>
</response>


I am sure am i mis doing something, but can not figure out what. I read through several times all online documentation plus the full examples ( slashdot RSS feed ). I would gladly have feedback from anyone who tried to index HTTP/XML sources, and got it to work smoothly.

Thanks a million in advance,

Regards,
Nicolas
--
Nicolas Pastorino
eZ Systems ( Western Europe )  |  http://ez.no




  • DataImportHandler : Ho... Nicolas Pastorino

Reply via email to