Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Noble Paul നോബിള്‍ नोब्ळ् Tue, 10 Jun 2008 09:05:48 -0700

It is a bug, nice catch
there needs to be a null check there in the method
can us just try replacing the method with the following?


private Node getMatchingChild(XMLStreamReader parser) {
      if(childNodes == null) return null;
      String localName = parser.getLocalName();
      for (Node n : childNodes) {
        if (n.name.equals(localName)) {
          if (n.attribAndValues == null)
            return n;
          if (checkForAttributes(parser, n.attribAndValues))
            return n;
        }
      }
      return null;
    }

I tried with that code and it is working. We shall add it in the next patch


--Noble
On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:
> I just forgot to mention the error related to the description below. I get
> the following when running a full-import ( sorry for the noise .. ) :
>
> SEVERE: Full Import failed
> java.lang.RuntimeException: java.lang.NullPointerException
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:207)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:161)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:144)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:280)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:302)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:173)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:134)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:323)
>        at
> org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:374)
>        at
> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: java.lang.NullPointerException
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.getMatchingChild(XPathRecordReader.java:198)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:171)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>        ... 31 more
>
> Regards,
> Nicolas Pastorino
>
> On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote:
>
>> Thanks a lot, it works fine now, fetching subelements properly.
>> The only issue left is that the XPath syntax passed in the data-config.xml
>> does not seem to work properly. As an example, processing the following
>> entity :
>>
>> <root>
>>        <contenido id="10097" idioma="cat">
>>        <antetitulo></antetitulo>
>>        <titulo>
>>                This is my title
>>        </titulo>
>>        <resumen>
>>                This is my summary
>>        </resumen>
>>        <texto>
>>                This is the body of my text
>>        </texto>
>>        </contenido>
>> </root>
>>
>> and trying to fill a solr field with the 'id' attribute of the 'contenido'
>> tag with the following config :
>> <field column="m_guid" xpath="/root/contenido/@id" />
>>
>> does not seem to work properly.
>>
>> Thanks a lot for your time already !
>>
>> Regards,
>> Nicolas Pastorino
>>
>>
>>
>> On Jun 10, 2008, at 14:55 , Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>
>>> The configuration is fine but for one detail
>>> The documents are to be created for the entity 'oldsearchcontent' not
>>> for the root entity . so add an attribute rootEntity="false" for the
>>> entity 'oldsearchcontentlist' as follows.
>>>
>>>   <entity name="oldsearchcontentlist"
>>>
>>>
>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&amp;urlsonly=1";
>>>                               processor="XPathEntityProcessor"
>>>                               forEach="/root/entries/entry"
>>>                               rootEntity="false">
>>>
>>> this means that the entity directly under this
>>> ('oldsearchcontent')will be treated as the root and documents will be
>>> created for that.
>>> --Noble
>>>
>>> On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> 
>>> wrote:
>>>>
>>>> Hello fellow Solr users !
>>>>
>>>>
>>>> I am in the process of trying to index XML documents in Solr. I went for
>>>> the
>>>> DataImportHandler approach, which seemed to perfectly suit this need.
>>>> Due to
>>>> the large amount of XML documents to be indexed ( ~60MB ), i thought i
>>>> would
>>>> hardly be possible to feed solr with the concatenation of all these docs
>>>> at
>>>> once. Hence this small php script i wrote, serving on HTTP the list of
>>>> these
>>>> documents, under the following form ( available from a local URL
>>>> replicated
>>>> in data-config.xml ) :
>>>>
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <root>
>>>> <entries>
>>>>       <entry>
>>>>               <realm>old_search_content</realm>
>>>>
>>>>
>>>>  
>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source>
>>>>       </entry>
>>>>       <entry>
>>>>               <realm>old_search_content</realm>
>>>>
>>>>
>>>>  
>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source>
>>>>       </entry>
>>>>       <entry>
>>>>               <realm>old_search_content</realm>
>>>>
>>>>
>>>>  
>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source>
>>>>       </entry>
>>>> </entries>
>>>> </root>
>>>>
>>>>
>>>> The idea would be to have one single data-config.xml configuration file
>>>> for
>>>> the DataImportHandler, which would read the listing presented above, and
>>>> request every single subitem and index it. Every subitem has the
>>>> following
>>>> structure :
>>>> <?xml version="1.0" encoding="ISO-8859-1" ?>
>>>> <root>
>>>>       <contenido id="10099" idioma="cat">
>>>>               <antetitulo><![CDATA[This is an introduction
>>>> text]]></antetitulo>
>>>>               <titulo><![CDATA[This is a title]]></titulo>
>>>>               <resumen><![CDATA[ This a a summary ]]></resumen>
>>>>               <texto><![CDATA[This is the body of my article<br><br>]]>
>>>>               </texto>
>>>>               <autor><![CDATA[John Doe]]></autor>
>>>>               <fecha><![CDATA[31/10/2001]]></fecha>
>>>>               <fuente><![CDATA[]]></fuente>
>>>>               <webexterna><![CDATA[]]></webexterna>
>>>>               <recursos></recursos>
>>>>               <ambitos></ambitos>
>>>>       </contenido>
>>>> </root>
>>>>
>>>>
>>>>
>>>> After struggling for a ( long ) while with different configuration
>>>> scenarios, here is a data-config.xml i ended up with :
>>>>
>>>>
>>>> <dataConfig>
>>>>       <dataSource type="HttpDataSource"/>
>>>>       <document>
>>>>               <entity name="oldsearchcontentlist"
>>>>                               pk="m_guid"
>>>>
>>>>
>>>>  
>>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&amp;urlsonly=1";
>>>>                               processor="XPathEntityProcessor"
>>>>                               forEach="/root/entries/entry">
>>>>
>>>>                       <field column="elementurl"
>>>> xpath="/root/entries/entry/source/" />
>>>>
>>>>                       <entity name="oldsearchcontent"
>>>>                               pk="m_guid"
>>>>                               url="${oldsearchcontentlist.elementurl}"
>>>>                               processor="XPathEntityProcessor"
>>>>                               forEach="/root/contenido"
>>>>                               transformer="TemplateTransformer">
>>>>                               <field column="m_guid"
>>>> xpath="/root/contenido/titulo" />
>>>>                       </entity>
>>>>               </entity>
>>>>       </document>
>>>> </dataConfig>
>>>>
>>>>
>>>> As a note, i had to check out Solr's trunk, and patched it with the
>>>> following : https://issues.apache.org/jira/browse/SOLR-469 (
>>>> https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch
>>>> ),
>>>> and recompiled.
>>>> Running the following command :
>>>>
>>>> http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on
>>>> tells me that no Document was created at all, and does not throw any
>>>> error....here is the full output :
>>>>
>>>>
>>>> <response>
>>>>       <lst name="responseHeader">
>>>>               <int name="status">0</int>
>>>>               <int name="QTime">39</int>
>>>>       </lst>
>>>>       <lst name="initArgs">
>>>>               <lst name="defaults">
>>>>                       <str name="config">data-config.xml</str>
>>>>                       <lst name="datasource">
>>>>                               <str name="type">HttpDataSource</str>
>>>>                       </lst>
>>>>               </lst>
>>>>       </lst>
>>>>       <str name="command">full-import</str>
>>>>       <str name="mode">debug</str>
>>>>       <null name="documents"/>
>>>>               <lst name="verbose-output">
>>>>               <lst name="entity:oldsearchcontentlist">
>>>>               <lst name="document#1">
>>>>                       <str name="query">
>>>>
>>>>
>>>>  
>>>> http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1
>>>>                       </str>
>>>>                       <str name="time-taken">0:0:0.23</str>
>>>>               </lst>
>>>>               </lst>
>>>>               </lst>
>>>>       <str name="status">idle</str>
>>>>       <str name="importResponse">Configuration Re-loaded
>>>> sucessfully</str>
>>>>       <lst name="statusMessages">
>>>>               <str name="Total Requests made to DataSource">1</str>
>>>>               <str name="Total Rows Fetched">0</str>
>>>>               <str name="Total Documents Skipped">0</str>
>>>>               <str name="Full Dump Started">2008-06-10 14:38:56</str>
>>>>               <str name="">
>>>>                       Indexing completed. Added/Updated: 0 documents.
>>>> Deleted 0 documents.
>>>>               </str>
>>>>               <str name="Committed">2008-06-10 14:38:56</str>
>>>>               <str name="Time taken ">0:0:0.32</str>
>>>>       </lst>
>>>>       <str name="WARNING">
>>>>               This response format is experimental.  It is likely to
>>>> change
>>>> in the future.
>>>>       </str>
>>>> </response>
>>>>
>>>>
>>>> I am sure am i mis doing something, but can not figure out what. I read
>>>> through several times all online documentation plus the full examples (
>>>> slashdot RSS feed ).
>>>> I would gladly have feedback from anyone who tried to index HTTP/XML
>>>> sources, and got it to work smoothly.
>>>>
>>>> Thanks a million in advance,
>>>>
>>>> Regards,
>>>> Nicolas
>>>> --
>>>> Nicolas Pastorino
>>>> eZ Systems ( Western Europe )  |  http://ez.no
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>
>> --
>> Nicolas Pastorino
>> Consultant - Trainer - System Developer
>> Phone :  +33 (0)4.78.37.01.34
>> eZ Systems ( Western Europe )  |  http://ez.no
>>
>>
>>
>>
>
> --
> Nicolas Pastorino
> Consultant - Trainer - System Developer
> Phone :  +33 (0)4.78.37.01.34
> eZ Systems ( Western Europe )  |  http://ez.no
>
>
>
>
>



-- 
--Noble Paul

Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Reply via email to