Was this one ever addressed? I'm seeing it in some small percentage of the documents that I index in 1.4-dev 708596M. I don't see a corresponding JIRA issue.
James Brady-3 wrote: > > Hi, > I'm seeing a problem mentioned in Solr-42, Highlighting problems with > HTMLStripWhitespaceTokenizerFactory: > https://issues.apache.org/jira/browse/SOLR-42 > > I'm indexing HTML documents, and am getting reams of "Mark invalid" > IOExceptions: > SEVERE: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(Unknown Source) > at > org > .apache > .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) > at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: > 728) > at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: > 742) > at java.io.Reader.read(Unknown Source) > at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118) > at > org > .apache > .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249) > at > org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33) > at > org > .apache > .solr > .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45) > at > org > .apache > .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) > at > org > .apache > .solr > .analysis > .RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: > 33) > at > org > .apache > .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.invertField(DocumentsWriter.java:1518) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.processField(DocumentsWriter.java:1407) > at org.apache.lucene.index.DocumentsWriter > $ThreadState.processDocument(DocumentsWriter.java:1116) > at > org > .apache > .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440) > at > org > .apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: > 2422) > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: > 1445) > > > This is using a ~1 week old version of Solr 1.3 from SVN. > > One workaround mentioned in that Jira issue was to move HTML stripping > outside of Solr; can anyone suggest a better approach than that? > > Thanks > James > > > -- View this message in context: http://www.nabble.com/IOException%3A-Mark-invalid-while-analyzing-HTML-tp17052153p20859862.html Sent from the Solr - User mailing list archive at Nabble.com.