Thanks a million for your time and help.
It indeed works smoothly now.

I also, by the way, had to apply the "patch" attached to the following message : http://www.nabble.com/Re%3A-How-to-describe-2-entities-in-dataConfig- for-the-DataImporter--p17577610.html in order to have the TemplateTransformer to not throw Null Pointer exceptions :)

Cheers !
--
Nicolas Pastorino

On Jun 10, 2008, at 18:05 , Noble Paul നോബിള്‍ नोब्ळ् wrote:

It is a bug, nice catch
there needs to be a null check there in the method
can us just try replacing the method with the following?

private Node getMatchingChild(XMLStreamReader parser) {
      if(childNodes == null) return null;
      String localName = parser.getLocalName();
      for (Node n : childNodes) {
        if (n.name.equals(localName)) {
          if (n.attribAndValues == null)
            return n;
          if (checkForAttributes(parser, n.attribAndValues))
            return n;
        }
      }
      return null;
    }

I tried with that code and it is working. We shall add it in the next patch


--Noble
On Tue, Jun 10, 2008 at 9:11 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:
I just forgot to mention the error related to the description below. I get
the following when running a full-import ( sorry for the noise .. ) :

SEVERE: Full Import failed
java.lang.RuntimeException: java.lang.NullPointerException
       at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords (XPathRecordReader.java:85)
       at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery (XPathEntityProcessor.java:207)
       at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow( XPathEntityProcessor.java:161)
       at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow (XPathEntityProcessor.java:144)
       at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument (DocBuilder.java:280)
       at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument (DocBuilder.java:302)
       at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump (DocBuilder.java:173)
       at
org.apache.solr.handler.dataimport.DocBuilder.execute (DocBuilder.java:134)
       at
org.apache.solr.handler.dataimport.DataImporter.doFullImport (DataImporter.java:323)
       at
org.apache.solr.handler.dataimport.DataImporter.rumCmd (DataImporter.java:374)
       at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBod y(DataImportHandler.java:179)
       at
org.apache.solr.handler.RequestHandlerBase.handleRequest (RequestHandlerBase.java:125)
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
       at
org.apache.solr.servlet.SolrDispatchFilter.execute (SolrDispatchFilter.java:338)
       at
org.apache.solr.servlet.SolrDispatchFilter.doFilter (SolrDispatchFilter.java:272)
       at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter (ServletHandler.java:1089)
       at
org.mortbay.jetty.servlet.ServletHandler.handle (ServletHandler.java:365)
       at
org.mortbay.jetty.security.SecurityHandler.handle (SecurityHandler.java:216)
       at
org.mortbay.jetty.servlet.SessionHandler.handle (SessionHandler.java:181)
       at
org.mortbay.jetty.handler.ContextHandler.handle (ContextHandler.java:712)
       at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
       at
org.mortbay.jetty.handler.ContextHandlerCollection.handle (ContextHandlerCollection.java:211)
       at
org.mortbay.jetty.handler.HandlerCollection.handle (HandlerCollection.java:114)
       at
org.mortbay.jetty.handler.HandlerWrapper.handle (HandlerWrapper.java:139)
       at org.mortbay.jetty.Server.handle(Server.java:285)
       at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java: 502)
       at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete (HttpConnection.java:821)
       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable (HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle (HttpConnection.java:378)
       at
org.mortbay.jetty.bio.SocketConnector$Connection.run (SocketConnector.java:226)
       at
org.mortbay.thread.BoundedThreadPool$PoolThread.run (BoundedThreadPool.java:442)
Caused by: java.lang.NullPointerException
       at
org.apache.solr.handler.dataimport.XPathRecordReader $Node.getMatchingChild(XPathRecordReader.java:198)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse (XPathRecordReader.java:171)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse (XPathRecordReader.java:174)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse (XPathRecordReader.java:174)
       at
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access $000(XPathRecordReader.java:89)
       at
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords (XPathRecordReader.java:82)
       ... 31 more

Regards,
Nicolas Pastorino

On Jun 10, 2008, at 17:38 , Nicolas Pastorino wrote:

Thanks a lot, it works fine now, fetching subelements properly.
The only issue left is that the XPath syntax passed in the data- config.xml does not seem to work properly. As an example, processing the following
entity :

<root>
       <contenido id="10097" idioma="cat">
       <antetitulo></antetitulo>
       <titulo>
               This is my title
       </titulo>
       <resumen>
               This is my summary
       </resumen>
       <texto>
               This is the body of my text
       </texto>
       </contenido>
</root>

and trying to fill a solr field with the 'id' attribute of the 'contenido'
tag with the following config :
<field column="m_guid" xpath="/root/contenido/@id" />

does not seem to work properly.

Thanks a lot for your time already !

Regards,
Nicolas Pastorino



On Jun 10, 2008, at 14:55 , Noble Paul നോബിള്‍ नोब्ळ् wrote:

The configuration is fine but for one detail
The documents are to be created for the entity 'oldsearchcontent' not for the root entity . so add an attribute rootEntity="false" for the
entity 'oldsearchcontentlist' as follows.

  <entity name="oldsearchcontentlist"


url="http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&amp;urlsonly=1"
                              processor="XPathEntityProcessor"
                              forEach="/root/entries/entry"
                              rootEntity="false">

this means that the entity directly under this
('oldsearchcontent')will be treated as the root and documents will be
created for that.
--Noble

On Tue, Jun 10, 2008 at 6:15 PM, Nicolas Pastorino <[EMAIL PROTECTED]> wrote:

Hello fellow Solr users !


I am in the process of trying to index XML documents in Solr. I went for
the
DataImportHandler approach, which seemed to perfectly suit this need.
Due to
the large amount of XML documents to be indexed ( ~60MB ), i thought i
would
hardly be possible to feed solr with the concatenation of all these docs
at
once. Hence this small php script i wrote, serving on HTTP the list of
these
documents, under the following form ( available from a local URL
replicated
in data-config.xml ) :


<?xml version="1.0" encoding="UTF-8"?>
<root>
<entries>
      <entry>
              <realm>old_search_content</realm>


<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/ xml/10098.xml</source>
      </entry>
      <entry>
              <realm>old_search_content</realm>


<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/ xml/10099.xml</source>
      </entry>
      <entry>
              <realm>old_search_content</realm>


<source>http://localhost/psc/trunk/ezfiles/extension/psc/doc/ xml/all_in_one.xml</source>
      </entry>
</entries>
</root>


The idea would be to have one single data-config.xml configuration file
for
the DataImportHandler, which would read the listing presented above, and
request every single subitem and index it. Every subitem has the
following
structure :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<root>
      <contenido id="10099" idioma="cat">
              <antetitulo><![CDATA[This is an introduction
text]]></antetitulo>
              <titulo><![CDATA[This is a title]]></titulo>
              <resumen><![CDATA[ This a a summary ]]></resumen>
<texto><![CDATA[This is the body of my article<br><br>]]>
              </texto>
              <autor><![CDATA[John Doe]]></autor>
              <fecha><![CDATA[31/10/2001]]></fecha>
              <fuente><![CDATA[]]></fuente>
              <webexterna><![CDATA[]]></webexterna>
              <recursos></recursos>
              <ambitos></ambitos>
      </contenido>
</root>



After struggling for a ( long ) while with different configuration
scenarios, here is a data-config.xml i ended up with :


<dataConfig>
      <dataSource type="HttpDataSource"/>
      <document>
              <entity name="oldsearchcontentlist"
                              pk="m_guid"


url="http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&amp;urlsonly=1"
                              processor="XPathEntityProcessor"
                              forEach="/root/entries/entry">

                      <field column="elementurl"
xpath="/root/entries/entry/source/" />

                      <entity name="oldsearchcontent"
                              pk="m_guid"
url="$ {oldsearchcontentlist.elementurl}"
                              processor="XPathEntityProcessor"
                              forEach="/root/contenido"
                              transformer="TemplateTransformer">
                              <field column="m_guid"
xpath="/root/contenido/titulo" />
                      </entity>
              </entity>
      </document>
</dataConfig>


As a note, i had to check out Solr's trunk, and patched it with the
following : https://issues.apache.org/jira/browse/SOLR-469 (
https://issues.apache.org/jira/secure/attachment/12380679/ SOLR-469.patch
),
and recompiled.
Running the following command :

http://localhost:8983/solr/dataimport?command=full- import&verbose=on&debug=on tells me that no Document was created at all, and does not throw any
error....here is the full output :


<response>
      <lst name="responseHeader">
              <int name="status">0</int>
              <int name="QTime">39</int>
      </lst>
      <lst name="initArgs">
              <lst name="defaults">
                      <str name="config">data-config.xml</str>
                      <lst name="datasource">
<str name="type">HttpDataSource</ str>
                      </lst>
              </lst>
      </lst>
      <str name="command">full-import</str>
      <str name="mode">debug</str>
      <null name="documents"/>
              <lst name="verbose-output">
              <lst name="entity:oldsearchcontentlist">
              <lst name="document#1">
                      <str name="query">


http://localhost/psc/trunk/ezfiles/list_old_content.php? limit=10&urlsonly=1
                      </str>
                      <str name="time-taken">0:0:0.23</str>
              </lst>
              </lst>
              </lst>
      <str name="status">idle</str>
      <str name="importResponse">Configuration Re-loaded
sucessfully</str>
      <lst name="statusMessages">
<str name="Total Requests made to DataSource">1</ str>
              <str name="Total Rows Fetched">0</str>
              <str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2008-06-10 14:38:56</str>
              <str name="">
Indexing completed. Added/Updated: 0 documents.
Deleted 0 documents.
              </str>
              <str name="Committed">2008-06-10 14:38:56</str>
              <str name="Time taken ">0:0:0.32</str>
      </lst>
      <str name="WARNING">
This response format is experimental. It is likely to
change
in the future.
      </str>
</response>


I am sure am i mis doing something, but can not figure out what. I read through several times all online documentation plus the full examples (
slashdot RSS feed ).
I would gladly have feedback from anyone who tried to index HTTP/XML
sources, and got it to work smoothly.

Thanks a million in advance,

Regards,
Nicolas
--
Nicolas Pastorino
eZ Systems ( Western Europe )  |  http://ez.no








--
--Noble Paul

--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no





--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no








--
--Noble Paul

--
Nicolas Pastorino
Consultant - Trainer - System Developer
Phone :  +33 (0)4.78.37.01.34
eZ Systems ( Western Europe )  |  http://ez.no




Reply via email to