Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Nicolas Pastorino Wed, 11 Jun 2008 03:20:43 -0700

Thanks a million for your time and help.
It indeed works smoothly now.

I also, by the way, had to apply the "patch" attached to thefollowing message :http://www.nabble.com/Re%3A-How-to-describe-2-entities-in-dataConfig-for-the-DataImporter--p17577610.htmlin order to have the TemplateTransformer to not throw Null Pointerexceptions :)


Cheers !
--
Nicolas Pastorino

On Jun 10, 2008, at 18:05 , Noble Paul നോബിള്‍नोब्ळ् wrote:

It is a bug, nice catch
there needs to be a null check there in the method
can us just try replacing the method with the following?

private Node getMatchingChild(XMLStreamReader parser) {
      if(childNodes == null) return null;
      String localName = parser.getLocalName();
      for (Node n : childNodes) {
        if (n.name.equals(localName)) {
          if (n.attribAndValues == null)
            return n;
          if (checkForAttributes(parser, n.attribAndValues))
            return n;
        }
      }
      return null;
    }
I tried with that code and it is working. We shall add it in thenext patch
--Noble
On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:
I just forgot to mention the error related to the descriptionbelow. I get
the following when running a full-import ( sorry for the noise .. ) :

SEVERE: Full Import failed
java.lang.RuntimeException: java.lang.NullPointerException
       at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
       at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:207)
       at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:161)
       at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:144)
       at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:280)
       at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:302)
       at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:173)
       at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:134)
       at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:323)
       at
org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:374)
       at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
       at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
       at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
       at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
       at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
       at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
       at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
       at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
       at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
       at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
       at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
       at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
       at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
       at org.mortbay.jetty.Server.handle(Server.java:285)
       at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
       at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
       at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
       at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: java.lang.NullPointerException
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.getMatchingChild(XPathRecordReader.java:198)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:171)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
       at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
       ... 31 more

Regards,
Nicolas Pastorino

On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote:
Thanks a lot, it works fine now, fetching subelements properly.
The only issue left is that the XPath syntax passed in the data-config.xmldoes not seem to work properly. As an example, processing thefollowing
entity :

<root>
       <contenido id="10097" idioma="cat">
       <antetitulo></antetitulo>
       <titulo>
               This is my title
       </titulo>
       <resumen>
               This is my summary
       </resumen>
       <texto>
               This is the body of my text
       </texto>
       </contenido>
</root>
and trying to fill a solr field with the 'id' attribute of the'contenido'
tag with the following config :
<field column="m_guid" xpath="/root/contenido/@id" />

does not seem to work properly.

Thanks a lot for your time already !

Regards,
Nicolas Pastorino
On Jun 10, 2008, at 14:55 , Noble Paul നോബിള്‍नोब्ळ् wrote:
The configuration is fine but for one detail
The documents are to be created for the entity'oldsearchcontent' notfor the root entity . so add an attribute rootEntity="false" forthe
entity 'oldsearchcontentlist' as follows.

  <entity name="oldsearchcontentlist"
url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1"
                              processor="XPathEntityProcessor"
                              forEach="/root/entries/entry"
                              rootEntity="false">

this means that the entity directly under this
('oldsearchcontent')will be treated as the root and documentswill be
created for that.
--Noble
On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]>wrote:
Hello fellow Solr users !
I am in the process of trying to index XML documents in Solr. Iwent for
the
DataImportHandler approach, which seemed to perfectly suit thisneed.
Due to
the large amount of XML documents to be indexed ( ~60MB ), ithought i
would
hardly be possible to feed solr with the concatenation of allthese docs
at
once. Hence this small php script i wrote, serving on HTTP thelist of
these
documents, under the following form ( available from a local URL
replicated
in data-config.xml ) :


<?xml version="1.0" encoding="UTF-8"?>
<root>
<entries>
      <entry>
              <realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source>
      </entry>
      <entry>
              <realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source>
      </entry>
      <entry>
              <realm>old_search_content</realm>
<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source>
      </entry>
</entries>
</root>
The idea would be to have one single data-config.xmlconfiguration file
for
the DataImportHandler, which would read the listing presentedabove, and
request every single subitem and index it. Every subitem has the
following
structure :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<root>
      <contenido id="10099" idioma="cat">
              <antetitulo><![CDATA[This is an introduction
text]]></antetitulo>
              <titulo><![CDATA[This is a title]]></titulo>
              <resumen><![CDATA[ This a a summary ]]></resumen>
<texto><![CDATA[This is the body of myarticle<br><br>]]>
              </texto>
              <autor><![CDATA[John Doe]]></autor>
              <fecha><![CDATA[31/10/2001]]></fecha>
              <fuente><![CDATA[]]></fuente>
              <webexterna><![CDATA[]]></webexterna>
              <recursos></recursos>
              <ambitos></ambitos>
      </contenido>
</root>



After struggling for a ( long ) while with different configuration
scenarios, here is a data-config.xml i ended up with :


<dataConfig>
      <dataSource type="HttpDataSource"/>
      <document>
              <entity name="oldsearchcontentlist"
                              pk="m_guid"
url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1"
                              processor="XPathEntityProcessor"
                              forEach="/root/entries/entry">

                      <field column="elementurl"
xpath="/root/entries/entry/source/" />

                      <entity name="oldsearchcontent"
                              pk="m_guid"
url="${oldsearchcontentlist.elementurl}"
                              processor="XPathEntityProcessor"
                              forEach="/root/contenido"
                              transformer="TemplateTransformer">
                              <field column="m_guid"
xpath="/root/contenido/titulo" />
                      </entity>
              </entity>
      </document>
</dataConfig>
As a note, i had to check out Solr's trunk, and patched it withthe
following : https://issues.apache.org/jira/browse/SOLR-469 (
https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch
),
and recompiled.
Running the following command :
http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=ontells me that no Document was created at all, and does notthrow any
error....here is the full output :


<response>
      <lst name="responseHeader">
              <int name="status">0</int>
              <int name="QTime">39</int>
      </lst>
      <lst name="initArgs">
              <lst name="defaults">
                      <str name="config">data-config.xml</str>
                      <lst name="datasource">
<str name="type">HttpDataSource</str>
                      </lst>
              </lst>
      </lst>
      <str name="command">full-import</str>
      <str name="mode">debug</str>
      <null name="documents"/>
              <lst name="verbose-output">
              <lst name="entity:oldsearchcontentlist">
              <lst name="document#1">
                      <str name="query">
http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1
                      </str>
                      <str name="time-taken">0:0:0.23</str>
              </lst>
              </lst>
              </lst>
      <str name="status">idle</str>
      <str name="importResponse">Configuration Re-loaded
sucessfully</str>
      <lst name="statusMessages">
<str name="Total Requests made to DataSource">1</str>
              <str name="Total Rows Fetched">0</str>
              <str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2008-06-1014:38:56</str>
              <str name="">
Indexing completed. Added/Updated: 0documents.
Deleted 0 documents.
              </str>
              <str name="Committed">2008-06-10 14:38:56</str>
              <str name="Time taken ">0:0:0.32</str>
      </lst>
      <str name="WARNING">
This response format is experimental. It islikely to
change
in the future.
      </str>
</response>
I am sure am i mis doing something, but can not figure outwhat. I readthrough several times all online documentation plus the fullexamples (
slashdot RSS feed ).
I would gladly have feedback from anyone who tried to indexHTTP/XML
sources, and got it to work smoothly.

Thanks a million in advance,

Regards,
Nicolas
--
Nicolas Pastorino
eZ Systems ( Western Europe )  |  http://ez.no
--
--Noble Paul
--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no
--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no
--
--Noble Paul


--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no

Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Reply via email to