We are cutting a a patch which incorporates all the recent bug fixes, so that you guys do not have to apply patches over patches
--Noble On Wed, Jun 11, 2008 at 3:49 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote: > Thanks a million for your time and help. > It indeed works smoothly now. > > I also, by the way, had to apply the "patch" attached to the following > message : > http://www.nabble.com/Re%3A-How-to-describe-2-entities-in-dataConfig-for-the-DataImporter--p17577610.html > in order to have the TemplateTransformer to not throw Null Pointer > exceptions :) > > Cheers ! > -- > Nicolas Pastorino > > On Jun 10, 2008, at 18:05 , Noble Paul നോബിള് नोब्ळ् wrote: > >> It is a bug, nice catch >> there needs to be a null check there in the method >> can us just try replacing the method with the following? >> >> private Node getMatchingChild(XMLStreamReader parser) { >> if(childNodes == null) return null; >> String localName = parser.getLocalName(); >> for (Node n : childNodes) { >> if (n.name.equals(localName)) { >> if (n.attribAndValues == null) >> return n; >> if (checkForAttributes(parser, n.attribAndValues)) >> return n; >> } >> } >> return null; >> } >> >> I tried with that code and it is working. We shall add it in the next >> patch >> >> >> --Noble >> On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote: >>> >>> I just forgot to mention the error related to the description below. I >>> get >>> the following when running a full-import ( sorry for the noise .. ) : >>> >>> SEVERE: Full Import failed >>> java.lang.RuntimeException: java.lang.NullPointerException >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:207) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:161) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:144) >>> at >>> >>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:280) >>> at >>> >>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:302) >>> at >>> >>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:173) >>> at >>> >>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:134) >>> at >>> >>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:323) >>> at >>> >>> org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:374) >>> at >>> >>> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) >>> at >>> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) >>> at >>> >>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) >>> at >>> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) >>> at >>> >>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) >>> at >>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) >>> at >>> >>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) >>> at >>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) >>> at >>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) >>> at >>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) >>> at >>> >>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) >>> at >>> >>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) >>> at >>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) >>> at org.mortbay.jetty.Server.handle(Server.java:285) >>> at >>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) >>> at >>> >>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) >>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) >>> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) >>> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) >>> at >>> >>> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) >>> at >>> >>> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) >>> Caused by: java.lang.NullPointerException >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.getMatchingChild(XPathRecordReader.java:198) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:171) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) >>> at >>> >>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) >>> ... 31 more >>> >>> Regards, >>> Nicolas Pastorino >>> >>> On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote: >>> >>>> Thanks a lot, it works fine now, fetching subelements properly. >>>> The only issue left is that the XPath syntax passed in the >>>> data-config.xml >>>> does not seem to work properly. As an example, processing the following >>>> entity : >>>> >>>> <root> >>>> <contenido id="10097" idioma="cat"> >>>> <antetitulo></antetitulo> >>>> <titulo> >>>> This is my title >>>> </titulo> >>>> <resumen> >>>> This is my summary >>>> </resumen> >>>> <texto> >>>> This is the body of my text >>>> </texto> >>>> </contenido> >>>> </root> >>>> >>>> and trying to fill a solr field with the 'id' attribute of the >>>> 'contenido' >>>> tag with the following config : >>>> <field column="m_guid" xpath="/root/contenido/@id" /> >>>> >>>> does not seem to work properly. >>>> >>>> Thanks a lot for your time already ! >>>> >>>> Regards, >>>> Nicolas Pastorino >>>> >>>> >>>> >>>> On Jun 10, 2008, at 14:55 , Noble Paul നോബിള് नोब्ळ् wrote: >>>> >>>>> The configuration is fine but for one detail >>>>> The documents are to be created for the entity 'oldsearchcontent' not >>>>> for the root entity . so add an attribute rootEntity="false" for the >>>>> entity 'oldsearchcontentlist' as follows. >>>>> >>>>> <entity name="oldsearchcontentlist" >>>>> >>>>> >>>>> >>>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1" >>>>> processor="XPathEntityProcessor" >>>>> forEach="/root/entries/entry" >>>>> rootEntity="false"> >>>>> >>>>> this means that the entity directly under this >>>>> ('oldsearchcontent')will be treated as the root and documents will be >>>>> created for that. >>>>> --Noble >>>>> >>>>> On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> >>>>> wrote: >>>>>> >>>>>> Hello fellow Solr users ! >>>>>> >>>>>> >>>>>> I am in the process of trying to index XML documents in Solr. I went >>>>>> for >>>>>> the >>>>>> DataImportHandler approach, which seemed to perfectly suit this need. >>>>>> Due to >>>>>> the large amount of XML documents to be indexed ( ~60MB ), i thought i >>>>>> would >>>>>> hardly be possible to feed solr with the concatenation of all these >>>>>> docs >>>>>> at >>>>>> once. Hence this small php script i wrote, serving on HTTP the list of >>>>>> these >>>>>> documents, under the following form ( available from a local URL >>>>>> replicated >>>>>> in data-config.xml ) : >>>>>> >>>>>> >>>>>> <?xml version="1.0" encoding="UTF-8"?> >>>>>> <root> >>>>>> <entries> >>>>>> <entry> >>>>>> <realm>old_search_content</realm> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source> >>>>>> </entry> >>>>>> <entry> >>>>>> <realm>old_search_content</realm> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source> >>>>>> </entry> >>>>>> <entry> >>>>>> <realm>old_search_content</realm> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source> >>>>>> </entry> >>>>>> </entries> >>>>>> </root> >>>>>> >>>>>> >>>>>> The idea would be to have one single data-config.xml configuration >>>>>> file >>>>>> for >>>>>> the DataImportHandler, which would read the listing presented above, >>>>>> and >>>>>> request every single subitem and index it. Every subitem has the >>>>>> following >>>>>> structure : >>>>>> <?xml version="1.0" encoding="ISO-8859-1" ?> >>>>>> <root> >>>>>> <contenido id="10099" idioma="cat"> >>>>>> <antetitulo><![CDATA[This is an introduction >>>>>> text]]></antetitulo> >>>>>> <titulo><![CDATA[This is a title]]></titulo> >>>>>> <resumen><![CDATA[ This a a summary ]]></resumen> >>>>>> <texto><![CDATA[This is the body of my article<br><br>]]> >>>>>> </texto> >>>>>> <autor><![CDATA[John Doe]]></autor> >>>>>> <fecha><![CDATA[31/10/2001]]></fecha> >>>>>> <fuente><![CDATA[]]></fuente> >>>>>> <webexterna><![CDATA[]]></webexterna> >>>>>> <recursos></recursos> >>>>>> <ambitos></ambitos> >>>>>> </contenido> >>>>>> </root> >>>>>> >>>>>> >>>>>> >>>>>> After struggling for a ( long ) while with different configuration >>>>>> scenarios, here is a data-config.xml i ended up with : >>>>>> >>>>>> >>>>>> <dataConfig> >>>>>> <dataSource type="HttpDataSource"/> >>>>>> <document> >>>>>> <entity name="oldsearchcontentlist" >>>>>> pk="m_guid" >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1" >>>>>> processor="XPathEntityProcessor" >>>>>> forEach="/root/entries/entry"> >>>>>> >>>>>> <field column="elementurl" >>>>>> xpath="/root/entries/entry/source/" /> >>>>>> >>>>>> <entity name="oldsearchcontent" >>>>>> pk="m_guid" >>>>>> url="${oldsearchcontentlist.elementurl}" >>>>>> processor="XPathEntityProcessor" >>>>>> forEach="/root/contenido" >>>>>> transformer="TemplateTransformer"> >>>>>> <field column="m_guid" >>>>>> xpath="/root/contenido/titulo" /> >>>>>> </entity> >>>>>> </entity> >>>>>> </document> >>>>>> </dataConfig> >>>>>> >>>>>> >>>>>> As a note, i had to check out Solr's trunk, and patched it with the >>>>>> following : https://issues.apache.org/jira/browse/SOLR-469 ( >>>>>> >>>>>> https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch >>>>>> ), >>>>>> and recompiled. >>>>>> Running the following command : >>>>>> >>>>>> >>>>>> http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on >>>>>> tells me that no Document was created at all, and does not throw any >>>>>> error....here is the full output : >>>>>> >>>>>> >>>>>> <response> >>>>>> <lst name="responseHeader"> >>>>>> <int name="status">0</int> >>>>>> <int name="QTime">39</int> >>>>>> </lst> >>>>>> <lst name="initArgs"> >>>>>> <lst name="defaults"> >>>>>> <str name="config">data-config.xml</str> >>>>>> <lst name="datasource"> >>>>>> <str name="type">HttpDataSource</str> >>>>>> </lst> >>>>>> </lst> >>>>>> </lst> >>>>>> <str name="command">full-import</str> >>>>>> <str name="mode">debug</str> >>>>>> <null name="documents"/> >>>>>> <lst name="verbose-output"> >>>>>> <lst name="entity:oldsearchcontentlist"> >>>>>> <lst name="document#1"> >>>>>> <str name="query"> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1 >>>>>> </str> >>>>>> <str name="time-taken">0:0:0.23</str> >>>>>> </lst> >>>>>> </lst> >>>>>> </lst> >>>>>> <str name="status">idle</str> >>>>>> <str name="importResponse">Configuration Re-loaded >>>>>> sucessfully</str> >>>>>> <lst name="statusMessages"> >>>>>> <str name="Total Requests made to DataSource">1</str> >>>>>> <str name="Total Rows Fetched">0</str> >>>>>> <str name="Total Documents Skipped">0</str> >>>>>> <str name="Full Dump Started">2008-06-10 14:38:56</str> >>>>>> <str name=""> >>>>>> Indexing completed. Added/Updated: 0 documents. >>>>>> Deleted 0 documents. >>>>>> </str> >>>>>> <str name="Committed">2008-06-10 14:38:56</str> >>>>>> <str name="Time taken ">0:0:0.32</str> >>>>>> </lst> >>>>>> <str name="WARNING"> >>>>>> This response format is experimental. It is likely to >>>>>> change >>>>>> in the future. >>>>>> </str> >>>>>> </response> >>>>>> >>>>>> >>>>>> I am sure am i mis doing something, but can not figure out what. I >>>>>> read >>>>>> through several times all online documentation plus the full examples >>>>>> ( >>>>>> slashdot RSS feed ). >>>>>> I would gladly have feedback from anyone who tried to index HTTP/XML >>>>>> sources, and got it to work smoothly. >>>>>> >>>>>> Thanks a million in advance, >>>>>> >>>>>> Regards, >>>>>> Nicolas >>>>>> -- >>>>>> Nicolas Pastorino >>>>>> eZ Systems ( Western Europe ) | http://ez.no >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> --Noble Paul >>>> >>>> -- >>>> Nicolas Pastorino >>>> Consultant - Trainer - System Developer >>>> Phone : +33 (0)4.78.37.01.34 >>>> eZ Systems ( Western Europe ) | http://ez.no >>>> >>>> >>>> >>>> >>> >>> -- >>> Nicolas Pastorino >>> Consultant - Trainer - System Developer >>> Phone : +33 (0)4.78.37.01.34 >>> eZ Systems ( Western Europe ) | http://ez.no >>> >>> >>> >>> >>> >> >> >> >> -- >> --Noble Paul > > -- > Nicolas Pastorino > Consultant - Trainer - System Developer > Phone : +33 (0)4.78.37.01.34 > eZ Systems ( Western Europe ) | http://ez.no > > > > > -- --Noble Paul