Hi Fergus,

It seems a field it is expecting is missing from the XML.

<field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" />
<field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1"
sourceColName="*fileAbsePath*"/>

I guess "fileAbsePath" is a typo? Can you check if that is the cause?


On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:

> Shalin
>
> Downloaded nightly for 21jan and tried DIH again. Its better but
> still broken. Dozens of embeded tags are stripped from documents
> but it now fails every few documents for no reason I can see. Manually
> removing embeded tags causes a given problem document to be indexed,
> only to have a it fail on one of the next few documents. I think the
> problem is still in stripHTML
>
> Here is the traceback.
>
> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
> INFO: Server startup in 3377 ms
> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> INFO: Read dataimport.properties
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
> status=0 QTime=13
> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> INFO: Starting Full Import
> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
> deleteAll
> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
> INFO: SolrDeletionPolicy.onInit: commits:num=2
>
>  
> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>
>  
> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
> updateCommits
> INFO: last commit = 1232539612131
> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: jc document : null
> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
> Processing Document # 9
>        at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>        ... 9 more
> Caused by: java.util.NoSuchElementException
>        at
> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>        ... 10 more
> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
> Processing Document # 9
>        at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>         at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>        at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>        ... 9 more
> Caused by: java.util.NoSuchElementException
>        at
> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>        at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>        ... 10 more
> Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2
> rollback
> INFO: start rollback
>
>
>
> >Ah, it needs a null check for multi valued fields. I've committed a fix to
> >trunk. The next nightly build should have it. You can checkout and build
> >from the trunk if need this immediately.
> >
> >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fer...@twig.me.uk>
> wrote:
> >
> >> Hmmm,
> >>
> >> Just to clarify I retested the thing using the nightly as of today
> >> 18-jan-2009. The problem is still there and this traceback is from
> >> that nightly.
> >>
> >> >>This looks fine. Can you post the stack trace?
> >> >>
> >> >Yep, here is the juicy bit. Let me know if you need more.
> >> >
> >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
> >> >INFO: Server startup in 2390 ms
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
> >> >INFO: [janesdocs] webapp=/solr path=/dataimport
> >> params={command=full-import} status=0 QTime=12
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
> >> readIndexerProperties
> >> >INFO: Read dataimport.properties
> >> >Jan 19, 2009 11:14:06 AM
> org.apache.solr.handler.dataimport.DataImporter
> >> doFullImport
> >> >INFO: Starting Full Import
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> >> deleteAll
> >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
> >> >INFO: SolrDeletionPolicy.onInit: commits:num=2
> >> >
> >>
> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
> >> >
> >>
> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
> >> updateCommits
> >> >INFO: last commit = 1232363283059
> >> >Jan 19, 2009 11:14:06 AM
> >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
> >> >WARNING: transformer threw error
> >> >java.lang.NullPointerException
> >> >       at java.io.StringReader.<init>(StringReader.java:33)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
> >> buildDocument
> >> >SEVERE: Exception while processing: janescurrent document : null
> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.NullPointerException
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >> >Caused by: java.lang.NullPointerException
> >> >       at java.io.StringReader.<init>(StringReader.java:33)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >> >       ... 9 more
> >> >Jan 19, 2009 11:14:06 AM
> org.apache.solr.handler.dataimport.DataImporter
> >> doFullImport
> >> >SEVERE: Full Import failed
> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.NullPointerException
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
> >> >Caused by: java.lang.NullPointerException
> >> >       at java.io.StringReader.<init>(StringReader.java:33)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
> >> >       at
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
> >> >       ... 9 more
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> >> rollback
> >> >INFO: start rollback
> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
> >> rollback
> >> >INFO: end_rollback
> >> >
> >> >
> >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fer...@twig.me.uk>
> >> wrote:
> >> >>
> >> >>> Hello all,
> >> >>>
> >> >>> I have the following DIH data-config.xml file. Adding
> >> >>> HTMLStripTransformer and the associated stripHTML on the
> >> >>> para tag seems to have broke things. I am using a nightly
> >> >>> build from 12-jan-2009
> >> >>>
> >> >>> The /record/sect1/para contains HTML sub tags which need
> >> >>> to be discarded. Is my use of stripHTML correct?
> >> >>>
> >> >>> <dataConfig>
> >> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
> >> >>>  <document>
> >> >>>     <entity name="jcurrent"
> >> >>>        processor="FileListEntityProcessor"
> >> >>>        fileName=".*xml"
> >> >>>        newerThan="'NOW-1000DAYS'"
> >> >>>        recursive="true"
> >> >>>        rootEntity="false"
> >> >>>        dataSource="null"
> >> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
> >> >>>
> >> >>>        <entity name="x"
> >> >>>           dataSource="myfilereader"
> >> >>>           processor="XPathEntityProcessor"
> >> >>>           url="${jcurrent.fileAbsolutePath}"
> >> >>>           stream="false"
> >> >>>           forEach="/record"
> >> >>>
> >> >>>
> >>
> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
> >> >>>
> >> >>>           <field column="fileAbsPath"
> >> >>> template="${jcurrent.fileAbsolutePath}" />
> >> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
> >> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
> >> >>>           <field column="title"    xpath="/record/title" />
> >> >>>           <field column="para"     xpath="/record/sect1/para"
> >> >>> stripHTML="true" />
> >> >>>           <field column="subject"
> >> >>>  xpath="/record/metadata/subje...@qualifier='fullTitle']"   />
> >> >>>           <field column="pubname"
> >> >>>  xpath="/record/metadata/subje...@qualifier='publication']" />
> >> >>>           <field column="pubdate"
> >> >>>  xpath="/record/metadata/da...@qualifier='pubDate']"
> >> >>> dateTimeFormat="yyyyMMdd"   />
> >> >>>           </entity>
> >> >>>        </entity>
> >> >>>     </document>
> >> >>>  </dataConfig>
> >> >>>
> >> >>> --
> >> >>>
> >--
> >Regards,
> >Shalin Shekhar Mangar.
>
> --
>
> ===============================================================
> Fergus McMenemie               
> Email:fer...@twig.me.uk<email%3afer...@twig.me.uk>
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to