Hello fellow Solr users !
I am in the process of trying to index XML documents in Solr. I
went for the
DataImportHandler approach, which seemed to perfectly suit this
need. Due to
the large amount of XML documents to be indexed ( ~60MB ), i
thought i would
hardly be possible to feed solr with the concatenation of all
these docs at
once. Hence this small php script i wrote, serving on HTTP the
list of these
documents, under the following form ( available from a local URL
replicated
in data-config.xml ) :
<?xml version="1.0" encoding="UTF-8"?>
<root>
<entries>
<entry>
<realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/
10098.xml</source>
</entry>
<entry>
<realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/
10099.xml</source>
</entry>
<entry>
<realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/
all_in_one.xml</source>
</entry>
</entries>
</root>
The idea would be to have one single data-config.xml configuration
file for
the DataImportHandler, which would read the listing presented
above, and
request every single subitem and index it. Every subitem has the
following
structure :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<root>
<contenido id="10099" idioma="cat">
<antetitulo><![CDATA[This is an introduction
text]]></antetitulo>
<titulo><![CDATA[This is a title]]></titulo>
<resumen><![CDATA[ This a a summary ]]></resumen>
<texto><![CDATA[This is the body of my
article<br><br>]]>
</texto>
<autor><![CDATA[John Doe]]></autor>
<fecha><![CDATA[31/10/2001]]></fecha>
<fuente><![CDATA[]]></fuente>
<webexterna><![CDATA[]]></webexterna>
<recursos></recursos>
<ambitos></ambitos>
</contenido>
</root>
After struggling for a ( long ) while with different configuration
scenarios, here is a data-config.xml i ended up with :
<dataConfig>
<dataSource type="HttpDataSource"/>
<document>
<entity name="oldsearchcontentlist"
pk="m_guid"
url="http://localhost/psc/trunk/ezfiles/list_old_content.php?
limit=10&urlsonly=1"
processor="XPathEntityProcessor"
forEach="/root/entries/entry">
<field column="elementurl"
xpath="/root/entries/entry/source/" />
<entity name="oldsearchcontent"
pk="m_guid"
url="$
{oldsearchcontentlist.elementurl}"
processor="XPathEntityProcessor"
forEach="/root/contenido"
transformer="TemplateTransformer">
<field column="m_guid"
xpath="/root/contenido/titulo" />
</entity>
</entity>
</document>
</dataConfig>
As a note, i had to check out Solr's trunk, and patched it with the
following : https://issues.apache.org/jira/browse/SOLR-469 (
https://issues.apache.org/jira/secure/attachment/12380679/
SOLR-469.patch ),
and recompiled.
Running the following command :
http://localhost:8983/solr/dataimport?command=full-
import&verbose=on&debug=on
tells me that no Document was created at all, and does not throw any
error....here is the full output :
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">39</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">data-config.xml</str>
<lst name="datasource">
<str name="type">HttpDataSource</str>
</lst>
</lst>
</lst>
<str name="command">full-import</str>
<str name="mode">debug</str>
<null name="documents"/>
<lst name="verbose-output">
<lst name="entity:oldsearchcontentlist">
<lst name="document#1">
<str name="query">
http://localhost/psc/trunk/ezfiles/list_old_content.php?
limit=10&urlsonly=1
</str>
<str name="time-taken">0:0:0.23</str>
</lst>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse">Configuration Re-loaded
sucessfully</str>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">0</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2008-06-10 14:38:56</
str>
<str name="">
Indexing completed. Added/Updated: 0
documents.
Deleted 0 documents.
</str>
<str name="Committed">2008-06-10 14:38:56</str>
<str name="Time taken ">0:0:0.32</str>
</lst>
<str name="WARNING">
This response format is experimental. It is likely
to change
in the future.
</str>
</response>
I am sure am i mis doing something, but can not figure out what. I
read
through several times all online documentation plus the full
examples (
slashdot RSS feed ).
I would gladly have feedback from anyone who tried to index HTTP/XML
sources, and got it to work smoothly.
Thanks a million in advance,
Regards,
Nicolas
--
Nicolas Pastorino
eZ Systems ( Western Europe ) | http://ez.no