It is a bug, nice catch there needs to be a null check there in the method can us just try replacing the method with the following?
private Node getMatchingChild(XMLStreamReader parser) { if(childNodes == null) return null; String localName = parser.getLocalName(); for (Node n : childNodes) { if (n.name.equals(localName)) { if (n.attribAndValues == null) return n; if (checkForAttributes(parser, n.attribAndValues)) return n; } } return null; } I tried with that code and it is working. We shall add it in the next patch --Noble On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote: > I just forgot to mention the error related to the description below. I get > the following when running a full-import ( sorry for the noise .. ) : > > SEVERE: Full Import failed > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:207) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:161) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:144) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:280) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:302) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:173) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:134) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:323) > at > org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:374) > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) > at org.mortbay.jetty.Server.handle(Server.java:285) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) > at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) > at > org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) > Caused by: java.lang.NullPointerException > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.getMatchingChild(XPathRecordReader.java:198) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:171) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) > ... 31 more > > Regards, > Nicolas Pastorino > > On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote: > >> Thanks a lot, it works fine now, fetching subelements properly. >> The only issue left is that the XPath syntax passed in the data-config.xml >> does not seem to work properly. As an example, processing the following >> entity : >> >> <root> >> <contenido id="10097" idioma="cat"> >> <antetitulo></antetitulo> >> <titulo> >> This is my title >> </titulo> >> <resumen> >> This is my summary >> </resumen> >> <texto> >> This is the body of my text >> </texto> >> </contenido> >> </root> >> >> and trying to fill a solr field with the 'id' attribute of the 'contenido' >> tag with the following config : >> <field column="m_guid" xpath="/root/contenido/@id" /> >> >> does not seem to work properly. >> >> Thanks a lot for your time already ! >> >> Regards, >> Nicolas Pastorino >> >> >> >> On Jun 10, 2008, at 14:55 , Noble Paul നോബിള് नोब्ळ् wrote: >> >>> The configuration is fine but for one detail >>> The documents are to be created for the entity 'oldsearchcontent' not >>> for the root entity . so add an attribute rootEntity="false" for the >>> entity 'oldsearchcontentlist' as follows. >>> >>> <entity name="oldsearchcontentlist" >>> >>> >>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1" >>> processor="XPathEntityProcessor" >>> forEach="/root/entries/entry" >>> rootEntity="false"> >>> >>> this means that the entity directly under this >>> ('oldsearchcontent')will be treated as the root and documents will be >>> created for that. >>> --Noble >>> >>> On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Hello fellow Solr users ! >>>> >>>> >>>> I am in the process of trying to index XML documents in Solr. I went for >>>> the >>>> DataImportHandler approach, which seemed to perfectly suit this need. >>>> Due to >>>> the large amount of XML documents to be indexed ( ~60MB ), i thought i >>>> would >>>> hardly be possible to feed solr with the concatenation of all these docs >>>> at >>>> once. Hence this small php script i wrote, serving on HTTP the list of >>>> these >>>> documents, under the following form ( available from a local URL >>>> replicated >>>> in data-config.xml ) : >>>> >>>> >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <root> >>>> <entries> >>>> <entry> >>>> <realm>old_search_content</realm> >>>> >>>> >>>> >>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10098.xml</source> >>>> </entry> >>>> <entry> >>>> <realm>old_search_content</realm> >>>> >>>> >>>> >>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/10099.xml</source> >>>> </entry> >>>> <entry> >>>> <realm>old_search_content</realm> >>>> >>>> >>>> >>>> <source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/xml/all_in_one.xml</source> >>>> </entry> >>>> </entries> >>>> </root> >>>> >>>> >>>> The idea would be to have one single data-config.xml configuration file >>>> for >>>> the DataImportHandler, which would read the listing presented above, and >>>> request every single subitem and index it. Every subitem has the >>>> following >>>> structure : >>>> <?xml version="1.0" encoding="ISO-8859-1" ?> >>>> <root> >>>> <contenido id="10099" idioma="cat"> >>>> <antetitulo><![CDATA[This is an introduction >>>> text]]></antetitulo> >>>> <titulo><![CDATA[This is a title]]></titulo> >>>> <resumen><![CDATA[ This a a summary ]]></resumen> >>>> <texto><![CDATA[This is the body of my article<br><br>]]> >>>> </texto> >>>> <autor><![CDATA[John Doe]]></autor> >>>> <fecha><![CDATA[31/10/2001]]></fecha> >>>> <fuente><![CDATA[]]></fuente> >>>> <webexterna><![CDATA[]]></webexterna> >>>> <recursos></recursos> >>>> <ambitos></ambitos> >>>> </contenido> >>>> </root> >>>> >>>> >>>> >>>> After struggling for a ( long ) while with different configuration >>>> scenarios, here is a data-config.xml i ended up with : >>>> >>>> >>>> <dataConfig> >>>> <dataSource type="HttpDataSource"/> >>>> <document> >>>> <entity name="oldsearchcontentlist" >>>> pk="m_guid" >>>> >>>> >>>> >>>> url="http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1" >>>> processor="XPathEntityProcessor" >>>> forEach="/root/entries/entry"> >>>> >>>> <field column="elementurl" >>>> xpath="/root/entries/entry/source/" /> >>>> >>>> <entity name="oldsearchcontent" >>>> pk="m_guid" >>>> url="${oldsearchcontentlist.elementurl}" >>>> processor="XPathEntityProcessor" >>>> forEach="/root/contenido" >>>> transformer="TemplateTransformer"> >>>> <field column="m_guid" >>>> xpath="/root/contenido/titulo" /> >>>> </entity> >>>> </entity> >>>> </document> >>>> </dataConfig> >>>> >>>> >>>> As a note, i had to check out Solr's trunk, and patched it with the >>>> following : https://issues.apache.org/jira/browse/SOLR-469 ( >>>> https://issues.apache.org/jira/secure/attachment/12380679/SOLR-469.patch >>>> ), >>>> and recompiled. >>>> Running the following command : >>>> >>>> http://localhost:8983/solr/dataimport?command=full-import&verbose=on&debug=on >>>> tells me that no Document was created at all, and does not throw any >>>> error....here is the full output : >>>> >>>> >>>> <response> >>>> <lst name="responseHeader"> >>>> <int name="status">0</int> >>>> <int name="QTime">39</int> >>>> </lst> >>>> <lst name="initArgs"> >>>> <lst name="defaults"> >>>> <str name="config">data-config.xml</str> >>>> <lst name="datasource"> >>>> <str name="type">HttpDataSource</str> >>>> </lst> >>>> </lst> >>>> </lst> >>>> <str name="command">full-import</str> >>>> <str name="mode">debug</str> >>>> <null name="documents"/> >>>> <lst name="verbose-output"> >>>> <lst name="entity:oldsearchcontentlist"> >>>> <lst name="document#1"> >>>> <str name="query"> >>>> >>>> >>>> >>>> http://localhost/psc/trunk/ezfiles/list_old_content.php?limit=10&urlsonly=1 >>>> </str> >>>> <str name="time-taken">0:0:0.23</str> >>>> </lst> >>>> </lst> >>>> </lst> >>>> <str name="status">idle</str> >>>> <str name="importResponse">Configuration Re-loaded >>>> sucessfully</str> >>>> <lst name="statusMessages"> >>>> <str name="Total Requests made to DataSource">1</str> >>>> <str name="Total Rows Fetched">0</str> >>>> <str name="Total Documents Skipped">0</str> >>>> <str name="Full Dump Started">2008-06-10 14:38:56</str> >>>> <str name=""> >>>> Indexing completed. Added/Updated: 0 documents. >>>> Deleted 0 documents. >>>> </str> >>>> <str name="Committed">2008-06-10 14:38:56</str> >>>> <str name="Time taken ">0:0:0.32</str> >>>> </lst> >>>> <str name="WARNING"> >>>> This response format is experimental. It is likely to >>>> change >>>> in the future. >>>> </str> >>>> </response> >>>> >>>> >>>> I am sure am i mis doing something, but can not figure out what. I read >>>> through several times all online documentation plus the full examples ( >>>> slashdot RSS feed ). >>>> I would gladly have feedback from anyone who tried to index HTTP/XML >>>> sources, and got it to work smoothly. >>>> >>>> Thanks a million in advance, >>>> >>>> Regards, >>>> Nicolas >>>> -- >>>> Nicolas Pastorino >>>> eZ Systems ( Western Europe ) | http://ez.no >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> -- >>> --Noble Paul >> >> -- >> Nicolas Pastorino >> Consultant - Trainer - System Developer >> Phone : +33 (0)4.78.37.01.34 >> eZ Systems ( Western Europe ) | http://ez.no >> >> >> >> > > -- > Nicolas Pastorino > Consultant - Trainer - System Developer > Phone : +33 (0)4.78.37.01.34 > eZ Systems ( Western Europe ) | http://ez.no > > > > > -- --Noble Paul