Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Noble Paul നോബിള്‍ नोब्ळ् Wed, 11 Jun 2008 04:37:18 -0700

We are cutting a a patch which incorporates all the recent bug fixes,
so that you guys do not have to apply patches over patches


--Noble

On Wed, Jun 11, 2008 at 3:49 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:
> Thanks a million for your time and help.
> It indeed works smoothly now.
>
> I also, by the way, had to apply the "patch" attached to the following
> message :
> http://www.nabble.com/Re%3A-How-to-describe-2-entities-in-dataConfig-for-the-DataImporter--p17577610.html
> in order to have the TemplateTransformer to not throw Null Pointer
> exceptions :)
>
> Cheers !
> --
> Nicolas Pastorino
>
> On Jun 10, 2008, at 18:05 , Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> It is a bug, nice catch
>> there needs to be a null check there in the method
>> can us just try replacing the method with the following?
>>
>> private Node getMatchingChild(XMLStreamReader parser) {
>>      if(childNodes == null) return null;
>>      String localName = parser.getLocalName();
>>      for (Node n : childNodes) {
>>        if (n.name.equals(localName)) {
>>          if (n.attribAndValues == null)
>>            return n;
>>          if (checkForAttributes(parser, n.attribAndValues))
>>            return n;
>>        }
>>      }
>>      return null;
>>    }
>>
>> I tried with that code and it is working. We shall add it in the next
>> patch
>>
>>
>> --Noble
>> On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:
>>>
>>> I just forgot to mention the error related to the description below. I
>>> get
>>> the following when running a full-import ( sorry for the noise .. ) :
>>>
>>> SEVERE: Full Import failed
>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:207)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:161)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:144)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:280)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:302)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:173)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:134)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:323)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:374)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
>>>       at
>>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
>>>       at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>       at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
>>>       at
>>>
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>>>       at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>>>       at
>>>
>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>>       at
>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>>       at
>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>>>       at
>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>>>       at
>>>
>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>>>       at
>>>
>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>>       at
>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>>>       at org.mortbay.jetty.Server.handle(Server.java:285)
>>>       at
>>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>>>       at
>>>
>>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>>>       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>>>       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>>>       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>>>       at
>>>
>>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>>>       at
>>>
>>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
>>> Caused by: java.lang.NullPointerException
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.getMatchingChild(XPathRecordReader.java:198)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:171)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>>       at
>>>
>>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>>>       ... 31 more
>>>
>>> Regards,
>>> Nicolas Pastorino
>>>
>>> On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote:
>>>
>>>> Thanks a lot, it works fine now, fetching subelements properly.
>>>> The only issue left is that the XPath syntax passed in the
>>>> data-config.xml
>>>> does not seem to work properly. As an example, processing the following
>>>> entity :
>>>>
>>>> <root>
>>>>       <contenido id="10097" idioma="cat">
>>>>       <antetitulo></antetitulo>
>>>>       <titulo>
>>>>               This is my title
>>>>       </titulo>
>>>>       <resumen>
>>>>               This is my summary
>>>>       </resumen>
>>>>       <texto>
>>>>               This is the body of my text
>>>>       </texto>
>>>>       </contenido>
>>>> </root>
>>>>
>>>> and trying to fill a solr field with the 'id' attribute of the
>>>> 'contenido'
>>>> tag with the following config :
>>>> <field column="m_guid" xpath="/root/contenido/@id" />
>>>>
>>>> does not seem to work properly.
>>>>
>>>> Thanks a lot for your time already !
>>>>
>>>> Regards,
>>>> Nicolas Pastorino
>>>>
>>>>
>>>>
>>>> On Jun 10, 2008, at 14:55 , Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>>
>>>>> The configuration is fine but for one detail
>>>>> The documents are to be created for the entity 'oldsearchcontent' not
>>>>> for the root entity . so add an attribute rootEntity="false" for the
>>>>> entity 'oldsearchcontentlist' as follows.
>>>>>
>>>>>  <entity name="oldsearchcontentlist"
>>>>>
>>>>>
>>>>>
>>>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&amp;urlsonly=1";
>>>>>                              processor="XPathEntityProcessor"
>>>>>                              forEach="/root/entries/entry"
>>>>>                              rootEntity="false">
>>>>>
>>>>> this means that the entity directly under this
>>>>> ('oldsearchcontent')will be treated as the root and documents will be
>>>>> created for that.
>>>>> --Noble
>>>>>
>>>>> On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> 
>>>>> wrote:
>>>>>>
>>>>>> Hello fellow Solr users !
>>>>>>
>>>>>>
>>>>>> I am in the process of trying to index XML documents in Solr. I went
>>>>>> for
>>>>>> the
>>>>>> DataImportHandler approach, which seemed to perfectly suit this need.
>>>>>> Due to
>>>>>> the large amount of XML documents to be indexed ( ~60MB ), i thought i
>>>>>> would
>>>>>> hardly be possible to feed solr with the concatenation of all these
>>>>>> docs
>>>>>> at
>>>>>> once. Hence this small php script i wrote, serving on HTTP the list of
>>>>>> these
>>>>>> documents, under the following form ( available from a local URL
>>>>>> replicated
>>>>>> in data-config.xml ) :
>>>>>>
>>>>>>
>>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>> <root>
>>>>>> <entries>
>>>>>>      <entry>
>>>>>>              <realm>old_search_content</realm>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source>
>>>>>>      </entry>
>>>>>>      <entry>
>>>>>>              <realm>old_search_content</realm>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source>
>>>>>>      </entry>
>>>>>>      <entry>
>>>>>>              <realm>old_search_content</realm>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source>
>>>>>>      </entry>
>>>>>> </entries>
>>>>>> </root>
>>>>>>
>>>>>>
>>>>>> The idea would be to have one single data-config.xml configuration
>>>>>> file
>>>>>> for
>>>>>> the DataImportHandler, which would read the listing presented above,
>>>>>> and
>>>>>> request every single subitem and index it. Every subitem has the
>>>>>> following
>>>>>> structure :
>>>>>> <?xml version="1.0" encoding="ISO-8859-1" ?>
>>>>>> <root>
>>>>>>      <contenido id="10099" idioma="cat">
>>>>>>              <antetitulo><![CDATA[This is an introduction
>>>>>> text]]></antetitulo>
>>>>>>              <titulo><![CDATA[This is a title]]></titulo>
>>>>>>              <resumen><![CDATA[ This a a summary ]]></resumen>
>>>>>>              <texto><![CDATA[This is the body of my article<br><br>]]>
>>>>>>              </texto>
>>>>>>              <autor><![CDATA[John Doe]]></autor>
>>>>>>              <fecha><![CDATA[31/10/2001]]></fecha>
>>>>>>              <fuente><![CDATA[]]></fuente>
>>>>>>              <webexterna><![CDATA[]]></webexterna>
>>>>>>              <recursos></recursos>
>>>>>>              <ambitos></ambitos>
>>>>>>      </contenido>
>>>>>> </root>
>>>>>>
>>>>>>
>>>>>>
>>>>>> After struggling for a ( long ) while with different configuration
>>>>>> scenarios, here is a data-config.xml i ended up with :
>>>>>>
>>>>>>
>>>>>> <dataConfig>
>>>>>>      <dataSource type="HttpDataSource"/>
>>>>>>      <document>
>>>>>>              <entity name="oldsearchcontentlist"
>>>>>>                              pk="m_guid"
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&amp;urlsonly=1";
>>>>>>                              processor="XPathEntityProcessor"
>>>>>>                              forEach="/root/entries/entry">
>>>>>>
>>>>>>                      <field column="elementurl"
>>>>>> xpath="/root/entries/entry/source/" />
>>>>>>
>>>>>>                      <entity name="oldsearchcontent"
>>>>>>                              pk="m_guid"
>>>>>>                              url="${oldsearchcontentlist.elementurl}"
>>>>>>                              processor="XPathEntityProcessor"
>>>>>>                              forEach="/root/contenido"
>>>>>>                              transformer="TemplateTransformer">
>>>>>>                              <field column="m_guid"
>>>>>> xpath="/root/contenido/titulo" />
>>>>>>                      </entity>
>>>>>>              </entity>
>>>>>>      </document>
>>>>>> </dataConfig>
>>>>>>
>>>>>>
>>>>>> As a note, i had to check out Solr's trunk, and patched it with the
>>>>>> following : https://issues.apache.org/jira/browse/SOLR-469 (
>>>>>>
>>>>>> https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch
>>>>>> ),
>>>>>> and recompiled.
>>>>>> Running the following command :
>>>>>>
>>>>>>
>>>>>> http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on
>>>>>> tells me that no Document was created at all, and does not throw any
>>>>>> error....here is the full output :
>>>>>>
>>>>>>
>>>>>> <response>
>>>>>>      <lst name="responseHeader">
>>>>>>              <int name="status">0</int>
>>>>>>              <int name="QTime">39</int>
>>>>>>      </lst>
>>>>>>      <lst name="initArgs">
>>>>>>              <lst name="defaults">
>>>>>>                      <str name="config">data-config.xml</str>
>>>>>>                      <lst name="datasource">
>>>>>>                              <str name="type">HttpDataSource</str>
>>>>>>                      </lst>
>>>>>>              </lst>
>>>>>>      </lst>
>>>>>>      <str name="command">full-import</str>
>>>>>>      <str name="mode">debug</str>
>>>>>>      <null name="documents"/>
>>>>>>              <lst name="verbose-output">
>>>>>>              <lst name="entity:oldsearchcontentlist">
>>>>>>              <lst name="document#1">
>>>>>>                      <str name="query">
>>>>>>
>>>>>>
>>>>>>
>>>>>>  
>>>>>> http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1
>>>>>>                      </str>
>>>>>>                      <str name="time-taken">0:0:0.23</str>
>>>>>>              </lst>
>>>>>>              </lst>
>>>>>>              </lst>
>>>>>>      <str name="status">idle</str>
>>>>>>      <str name="importResponse">Configuration Re-loaded
>>>>>> sucessfully</str>
>>>>>>      <lst name="statusMessages">
>>>>>>              <str name="Total Requests made to DataSource">1</str>
>>>>>>              <str name="Total Rows Fetched">0</str>
>>>>>>              <str name="Total Documents Skipped">0</str>
>>>>>>              <str name="Full Dump Started">2008-06-10 14:38:56</str>
>>>>>>              <str name="">
>>>>>>                      Indexing completed. Added/Updated: 0 documents.
>>>>>> Deleted 0 documents.
>>>>>>              </str>
>>>>>>              <str name="Committed">2008-06-10 14:38:56</str>
>>>>>>              <str name="Time taken ">0:0:0.32</str>
>>>>>>      </lst>
>>>>>>      <str name="WARNING">
>>>>>>              This response format is experimental.  It is likely to
>>>>>> change
>>>>>> in the future.
>>>>>>      </str>
>>>>>> </response>
>>>>>>
>>>>>>
>>>>>> I am sure am i mis doing something, but can not figure out what. I
>>>>>> read
>>>>>> through several times all online documentation plus the full examples
>>>>>> (
>>>>>> slashdot RSS feed ).
>>>>>> I would gladly have feedback from anyone who tried to index HTTP/XML
>>>>>> sources, and got it to work smoothly.
>>>>>>
>>>>>> Thanks a million in advance,
>>>>>>
>>>>>> Regards,
>>>>>> Nicolas
>>>>>> --
>>>>>> Nicolas Pastorino
>>>>>> eZ Systems ( Western Europe )  |  http://ez.no
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --Noble Paul
>>>>
>>>> --
>>>> Nicolas Pastorino
>>>> Consultant - Trainer - System Developer
>>>> Phone :  +33 (0)4.78.37.01.34
>>>> eZ Systems ( Western Europe )  |  http://ez.no
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Nicolas Pastorino
>>> Consultant - Trainer - System Developer
>>> Phone :  +33 (0)4.78.37.01.34
>>> eZ Systems ( Western Europe )  |  http://ez.no
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
> --
> Nicolas Pastorino
> Consultant - Trainer - System Developer
> Phone :  +33 (0)4.78.37.01.34
> eZ Systems ( Western Europe )  |  http://ez.no
>
>
>
>
>



-- 
--Noble Paul

Re: DataImportHandler : How to mix XPathEntityProcessor and TemplateTransformer

Reply via email to