Hi Fergus, It seems a field it is expecting is missing from the XML.
<field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" /> <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1" sourceColName="*fileAbsePath*"/> I guess "fileAbsePath" is a typo? Can you check if that is the cause? On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie <fer...@twig.me.uk> wrote: > Shalin > > Downloaded nightly for 21jan and tried DIH again. Its better but > still broken. Dozens of embeded tags are stripped from documents > but it now fails every few documents for no reason I can see. Manually > removing embeded tags causes a given problem document to be indexed, > only to have a it fail on one of the next few documents. I think the > problem is still in stripHTML > > Here is the traceback. > > Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start > INFO: Server startup in 3377 ms > Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter > readIndexerProperties > INFO: Read dataimport.properties > Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute > INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} > status=0 QTime=13 > Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter > doFullImport > INFO: Starting Full Import > Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 > deleteAll > INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX > Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit > INFO: SolrDeletionPolicy.onInit: commits:num=2 > > > commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] > > > commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] > Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy > updateCommits > INFO: last commit = 1232539612131 > Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder > buildDocument > SEVERE: Exception while processing: jc document : null > org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing > failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 > Processing Document # 9 > at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) > Caused by: java.lang.RuntimeException: java.util.NoSuchElementException > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) > ... 9 more > Caused by: java.util.NoSuchElementException > at > com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) > ... 10 more > Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter > doFullImport > SEVERE: Full Import failed > org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing > failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 > Processing Document # 9 > at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) > Caused by: java.lang.RuntimeException: java.util.NoSuchElementException > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) > ... 9 more > Caused by: java.util.NoSuchElementException > at > com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) > ... 10 more > Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 > rollback > INFO: start rollback > > > > >Ah, it needs a null check for multi valued fields. I've committed a fix to > >trunk. The next nightly build should have it. You can checkout and build > >from the trunk if need this immediately. > > > >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fer...@twig.me.uk> > wrote: > > > >> Hmmm, > >> > >> Just to clarify I retested the thing using the nightly as of today > >> 18-jan-2009. The problem is still there and this traceback is from > >> that nightly. > >> > >> >>This looks fine. Can you post the stack trace? > >> >> > >> >Yep, here is the juicy bit. Let me know if you need more. > >> > > >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start > >> >INFO: Server startup in 2390 ms > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute > >> >INFO: [janesdocs] webapp=/solr path=/dataimport > >> params={command=full-import} status=0 QTime=12 > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter > >> readIndexerProperties > >> >INFO: Read dataimport.properties > >> >Jan 19, 2009 11:14:06 AM > org.apache.solr.handler.dataimport.DataImporter > >> doFullImport > >> >INFO: Starting Full Import > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 > >> deleteAll > >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit > >> >INFO: SolrDeletionPolicy.onInit: commits:num=2 > >> > > >> > commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] > >> > > >> > commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy > >> updateCommits > >> >INFO: last commit = 1232363283059 > >> >Jan 19, 2009 11:14:06 AM > >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer > >> >WARNING: transformer threw error > >> >java.lang.NullPointerException > >> > at java.io.StringReader.<init>(StringReader.java:33) > >> > at > >> > org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) > >> > at > >> > org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) > >> > at > >> > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) > >> > at > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) > >> > at > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder > >> buildDocument > >> >SEVERE: Exception while processing: janescurrent document : null > >> >org.apache.solr.handler.dataimport.DataImportHandlerException: > >> java.lang.NullPointerException > >> > at > >> > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) > >> > at > >> > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) > >> > at > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) > >> > at > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) > >> >Caused by: java.lang.NullPointerException > >> > at java.io.StringReader.<init>(StringReader.java:33) > >> > at > >> > org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) > >> > at > >> > org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) > >> > at > >> > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) > >> > ... 9 more > >> >Jan 19, 2009 11:14:06 AM > org.apache.solr.handler.dataimport.DataImporter > >> doFullImport > >> >SEVERE: Full Import failed > >> >org.apache.solr.handler.dataimport.DataImportHandlerException: > >> java.lang.NullPointerException > >> > at > >> > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) > >> > at > >> > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) > >> > at > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) > >> > at > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) > >> > at > >> > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) > >> >Caused by: java.lang.NullPointerException > >> > at java.io.StringReader.<init>(StringReader.java:33) > >> > at > >> > org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) > >> > at > >> > org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) > >> > at > >> > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) > >> > ... 9 more > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 > >> rollback > >> >INFO: start rollback > >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 > >> rollback > >> >INFO: end_rollback > >> > > >> > > >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fer...@twig.me.uk> > >> wrote: > >> >> > >> >>> Hello all, > >> >>> > >> >>> I have the following DIH data-config.xml file. Adding > >> >>> HTMLStripTransformer and the associated stripHTML on the > >> >>> para tag seems to have broke things. I am using a nightly > >> >>> build from 12-jan-2009 > >> >>> > >> >>> The /record/sect1/para contains HTML sub tags which need > >> >>> to be discarded. Is my use of stripHTML correct? > >> >>> > >> >>> <dataConfig> > >> >>> <dataSource name="myfilereader" type="FileDataSource"/> > >> >>> <document> > >> >>> <entity name="jcurrent" > >> >>> processor="FileListEntityProcessor" > >> >>> fileName=".*xml" > >> >>> newerThan="'NOW-1000DAYS'" > >> >>> recursive="true" > >> >>> rootEntity="false" > >> >>> dataSource="null" > >> >>> baseDir="/Volumes/spare/ts/jxml/data/news/groups"> > >> >>> > >> >>> <entity name="x" > >> >>> dataSource="myfilereader" > >> >>> processor="XPathEntityProcessor" > >> >>> url="${jcurrent.fileAbsolutePath}" > >> >>> stream="false" > >> >>> forEach="/record" > >> >>> > >> >>> > >> > transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer"> > >> >>> > >> >>> <field column="fileAbsPath" > >> >>> template="${jcurrent.fileAbsolutePath}" /> > >> >>> <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" > >> >>> replaceWith="$1" sourceColName="fileAbsePath"/> > >> >>> <field column="title" xpath="/record/title" /> > >> >>> <field column="para" xpath="/record/sect1/para" > >> >>> stripHTML="true" /> > >> >>> <field column="subject" > >> >>> xpath="/record/metadata/subje...@qualifier='fullTitle']" /> > >> >>> <field column="pubname" > >> >>> xpath="/record/metadata/subje...@qualifier='publication']" /> > >> >>> <field column="pubdate" > >> >>> xpath="/record/metadata/da...@qualifier='pubDate']" > >> >>> dateTimeFormat="yyyyMMdd" /> > >> >>> </entity> > >> >>> </entity> > >> >>> </document> > >> >>> </dataConfig> > >> >>> > >> >>> -- > >> >>> > >-- > >Regards, > >Shalin Shekhar Mangar. > > -- > > =============================================================== > Fergus McMenemie > Email:fer...@twig.me.uk<email%3afer...@twig.me.uk> > Techmore Ltd Phone:(UK) 07721 376021 > > Unix/Mac/Intranets Analyst Programmer > =============================================================== > -- Regards, Shalin Shekhar Mangar.