Thanks a million for your time and help. It indeed works smoothly now.
I also, by the way, had to apply the "patch" attached to the following message : http://www.nabble.com/Re%3A-How-to-describe-2-entities-in-dataConfig- for-the-DataImporter--p17577610.html in order to have the TemplateTransformer to not throw Null Pointer exceptions :)
Cheers ! -- Nicolas PastorinoOn Jun 10, 2008, at 18:05 , Noble Paul നോബിള് नोब्ळ् wrote:
It is a bug, nice catch there needs to be a null check there in the method can us just try replacing the method with the following? private Node getMatchingChild(XMLStreamReader parser) { if(childNodes == null) return null; String localName = parser.getLocalName(); for (Node n : childNodes) { if (n.name.equals(localName)) { if (n.attribAndValues == null) return n; if (checkForAttributes(parser, n.attribAndValues)) return n; } } return null; }I tried with that code and it is working. We shall add it in the next patch--Noble On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:I just forgot to mention the error related to the description below. I getthe following when running a full-import ( sorry for the noise .. ) : SEVERE: Full Import failed java.lang.RuntimeException: java.lang.NullPointerException atorg.apache.solr.handler.dataimport.XPathRecordReader.streamRecords (XPathRecordReader.java:85)atorg.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery (XPathEntityProcessor.java:207)atorg.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( XPathEntityProcessor.java:161)atorg.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow (XPathEntityProcessor.java:144)atorg.apache.solr.handler.dataimport.DocBuilder.buildDocument (DocBuilder.java:280)atorg.apache.solr.handler.dataimport.DocBuilder.buildDocument (DocBuilder.java:302)atorg.apache.solr.handler.dataimport.DocBuilder.doFullDump (DocBuilder.java:173)atorg.apache.solr.handler.dataimport.DocBuilder.execute (DocBuilder.java:134)atorg.apache.solr.handler.dataimport.DataImporter.doFullImport (DataImporter.java:323)atorg.apache.solr.handler.dataimport.DataImporter.rumCmd (DataImporter.java:374)atorg.apache.solr.handler.dataimport.DataImportHandler.handleRequestBod y(DataImportHandler.java:179)atorg.apache.solr.handler.RequestHandlerBase.handleRequest (RequestHandlerBase.java:125)at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) atorg.apache.solr.servlet.SolrDispatchFilter.execute (SolrDispatchFilter.java:338)atorg.apache.solr.servlet.SolrDispatchFilter.doFilter (SolrDispatchFilter.java:272)atorg.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter (ServletHandler.java:1089)atorg.mortbay.jetty.servlet.ServletHandler.handle (ServletHandler.java:365)atorg.mortbay.jetty.security.SecurityHandler.handle (SecurityHandler.java:216)atorg.mortbay.jetty.servlet.SessionHandler.handle (SessionHandler.java:181)atorg.mortbay.jetty.handler.ContextHandler.handle (ContextHandler.java:712)at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) atorg.mortbay.jetty.handler.ContextHandlerCollection.handle (ContextHandlerCollection.java:211)atorg.mortbay.jetty.handler.HandlerCollection.handle (HandlerCollection.java:114)atorg.mortbay.jetty.handler.HandlerWrapper.handle (HandlerWrapper.java:139)at org.mortbay.jetty.Server.handle(Server.java:285) atorg.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java: 502)atorg.mortbay.jetty.HttpConnection$RequestHandler.headerComplete (HttpConnection.java:821)at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)at org.mortbay.jetty.HttpParser.parseAvailable (HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle (HttpConnection.java:378)atorg.mortbay.jetty.bio.SocketConnector$Connection.run (SocketConnector.java:226)atorg.mortbay.thread.BoundedThreadPool$PoolThread.run (BoundedThreadPool.java:442)Caused by: java.lang.NullPointerException atorg.apache.solr.handler.dataimport.XPathRecordReader $Node.getMatchingChild(XPathRecordReader.java:198)atorg.apache.solr.handler.dataimport.XPathRecordReader$Node.parse (XPathRecordReader.java:171)atorg.apache.solr.handler.dataimport.XPathRecordReader$Node.parse (XPathRecordReader.java:174)atorg.apache.solr.handler.dataimport.XPathRecordReader$Node.parse (XPathRecordReader.java:174)atorg.apache.solr.handler.dataimport.XPathRecordReader$Node.access $000(XPathRecordReader.java:89)atorg.apache.solr.handler.dataimport.XPathRecordReader.streamRecords (XPathRecordReader.java:82)... 31 more Regards, Nicolas Pastorino On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote:Thanks a lot, it works fine now, fetching subelements properly.The only issue left is that the XPath syntax passed in the data- config.xml does not seem to work properly. As an example, processing the followingentity : <root> <contenido id="10097" idioma="cat"> <antetitulo></antetitulo> <titulo> This is my title </titulo> <resumen> This is my summary </resumen> <texto> This is the body of my text </texto> </contenido> </root>and trying to fill a solr field with the 'id' attribute of the 'contenido'tag with the following config : <field column="m_guid" xpath="/root/contenido/@id" /> does not seem to work properly. Thanks a lot for your time already ! Regards, Nicolas PastorinoOn Jun 10, 2008, at 14:55 , Noble Paul നോബിള് नोब्ळ् wrote:The configuration is fine but for one detailThe documents are to be created for the entity 'oldsearchcontent' not for the root entity . so add an attribute rootEntity="false" for theentity 'oldsearchcontentlist' as follows. <entity name="oldsearchcontentlist"url="http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&urlsonly=1"processor="XPathEntityProcessor" forEach="/root/entries/entry" rootEntity="false"> this means that the entity directly under this('oldsearchcontent')will be treated as the root and documents will becreated for that. --NobleOn Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:Hello fellow Solr users !I am in the process of trying to index XML documents in Solr. I went fortheDataImportHandler approach, which seemed to perfectly suit this need.Due tothe large amount of XML documents to be indexed ( ~60MB ), i thought iwouldhardly be possible to feed solr with the concatenation of all these docsatonce. Hence this small php script i wrote, serving on HTTP the list ofthese documents, under the following form ( available from a local URL replicated in data-config.xml ) : <?xml version="1.0" encoding="UTF-8"?> <root> <entries> <entry> <realm>old_search_content</realm><source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/ xml/10098.xml</source></entry> <entry> <realm>old_search_content</realm><source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/ xml/10099.xml</source></entry> <entry> <realm>old_search_content</realm><source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/ xml/all_in_one.xml</source></entry> </entries> </root>The idea would be to have one single data-config.xml configuration fileforthe DataImportHandler, which would read the listing presented above, andrequest every single subitem and index it. Every subitem has the following structure : <?xml version="1.0" encoding="ISO-8859-1" ?> <root> <contenido id="10099" idioma="cat"> <antetitulo><![CDATA[This is an introduction text]]></antetitulo> <titulo><![CDATA[This is a title]]></titulo> <resumen><![CDATA[ This a a summary ]]></resumen><texto><![CDATA[This is the body of my article<br><br>]]></texto> <autor><![CDATA[John Doe]]></autor> <fecha><![CDATA[31/10/2001]]></fecha> <fuente><![CDATA[]]></fuente> <webexterna><![CDATA[]]></webexterna> <recursos></recursos> <ambitos></ambitos> </contenido> </root> After struggling for a ( long ) while with different configuration scenarios, here is a data-config.xml i ended up with : <dataConfig> <dataSource type="HttpDataSource"/> <document> <entity name="oldsearchcontentlist" pk="m_guid"url="http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&urlsonly=1"processor="XPathEntityProcessor" forEach="/root/entries/entry"> <field column="elementurl" xpath="/root/entries/entry/source/" /> <entity name="oldsearchcontent" pk="m_guid"url="$ {oldsearchcontentlist.elementurl}"processor="XPathEntityProcessor" forEach="/root/contenido" transformer="TemplateTransformer"> <field column="m_guid" xpath="/root/contenido/titulo" /> </entity> </entity> </document> </dataConfig>As a note, i had to check out Solr's trunk, and patched it with thefollowing : https://issues.apache.org/jira/browse/SOLR-469 (https://issues.apache.org/jira/secure/attachment/12380679/ SOLR-469.patch), and recompiled. Running the following command :http://localhost:8983/solr/dataimport?command=full- import&verbose=on&debug=on tells me that no Document was created at all, and does not throw anyerror....here is the full output : <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">39</int> </lst> <lst name="initArgs"> <lst name="defaults"> <str name="config">data-config.xml</str> <lst name="datasource"><str name="type">HttpDataSource</ str></lst> </lst> </lst> <str name="command">full-import</str> <str name="mode">debug</str> <null name="documents"/> <lst name="verbose-output"> <lst name="entity:oldsearchcontentlist"> <lst name="document#1"> <str name="query">http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&urlsonly=1</str> <str name="time-taken">0:0:0.23</str> </lst> </lst> </lst> <str name="status">idle</str> <str name="importResponse">Configuration Re-loaded sucessfully</str> <lst name="statusMessages"><str name="Total Requests made to DataSource">1</ str><str name="Total Rows Fetched">0</str> <str name="Total Documents Skipped">0</str><str name="Full Dump Started">2008-06-10 14:38:56</str><str name="">Indexing completed. Added/Updated: 0 documents.Deleted 0 documents. </str> <str name="Committed">2008-06-10 14:38:56</str> <str name="Time taken ">0:0:0.32</str> </lst> <str name="WARNING">This response format is experimental. It is likely tochange in the future. </str> </response>I am sure am i mis doing something, but can not figure out what. I read through several times all online documentation plus the full examples (slashdot RSS feed ).I would gladly have feedback from anyone who tried to index HTTP/XMLsources, and got it to work smoothly. Thanks a million in advance, Regards, Nicolas -- Nicolas Pastorino eZ Systems ( Western Europe ) | http://ez.no-- --Noble Paul-- Nicolas Pastorino Consultant - Trainer - System Developer Phone : +33 (0)4.78.37.01.34 eZ Systems ( Western Europe ) | http://ez.no-- Nicolas Pastorino Consultant - Trainer - System Developer Phone : +33 (0)4.78.37.01.34 eZ Systems ( Western Europe ) | http://ez.no-- --Noble Paul
-- Nicolas Pastorino Consultant - Trainer - System Developer Phone : +33 (0)4.78.37.01.34 eZ Systems ( Western Europe ) | http://ez.no