baseTokenizer is reset in the #reset method. Sarita
________________________________ From: Jack Krupansky <j...@basetechnology.com> To: solr-user@lucene.apache.org Sent: Sunday, May 5, 2013 1:37 PM Subject: Re: custom tokenizer error I didn't notice any call to the "reset" method for your base tokenizer. Is there any reason that you didn't just use char filters to replace colon and periods with spaces? -- Jack Krupansky -----Original Message----- From: Sarita Nair Sent: Friday, May 03, 2013 2:43 PM To: solr-user@lucene.apache.org Subject: custom tokenizer error I am using a custom Tokenizer, as part of analysis chain, for a Solr (4.2.1) field. On trying to index, Solr throws a NullPointerException. The unit tests for the custom tokenizer work fine. Any ideas as to what is it that I am missing/doing incorrectly will be appreciated. Here is the relevant schema.xml excerpt: <fieldType name="negated" class="solr.TextField" omitNorms="true" > <analyzer type="index"> <tokenizer class="some.other.solr.analysis.EmbeddedPunctuationTokenizer$Factory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> </analyzer> </fieldType> Here are the relevant pieces of the Tokenizer: /** * Intercepts each token produced by {@link StandardTokenizer#incrementToken()} * and checks for the presence of a colon or period. If found, splits the token * on the punctuation mark and adjusts the term and offset attributes of the * underlying {@link TokenStream} to create additional tokens. * * */ public class EmbeddedPunctuationTokenizer extends Tokenizer { private static final Pattern PUNCTUATION_SYMBOLS = Pattern.compile("[:.]"); private StandardTokenizer baseTokenizer; private CharTermAttribute termAttr; private OffsetAttribute offsetAttr; private /*@Nullable*/ String tokenAfterPunctuation = null; private int currentOffset = 0; public EmbeddedPunctuationTokenizer(final Reader reader) { super(reader); baseTokenizer = new StandardTokenizer(Version.MINIMUM_LUCENE_VERSION, reader); // Two TokenStreams are in play here: the one underlying the current // instance and the one underlying the StandardTokenizer. The attribute // instances must be associated with both. termAttr = baseTokenizer.addAttribute(CharTermAttribute.class); offsetAttr = baseTokenizer.addAttribute(OffsetAttribute.class); this.addAttributeImpl((CharTermAttributeImpl)termAttr); this.addAttributeImpl((OffsetAttributeImpl)offsetAttr); } @Override public void end() throws IOException { baseTokenizer.end(); super.end(); } @Override public void close() throws IOException { baseTokenizer.close(); super.close(); } @Override public void reset() throws IOException { super.reset(); baseTokenizer.reset(); currentOffset = 0; tokenAfterPunctuation = null; } @Override public final boolean incrementToken() throws IOException { clearAttributes(); if (tokenAfterPunctuation != null) { // Do not advance the underlying TokenStream if the previous call // found an embedded punctuation mark and set aside the substring // that follows it. Set the attributes instead from the substring, // bearing in mind that the substring could contain more embedded // punctuation marks. adjustAttributes(tokenAfterPunctuation); } else if (baseTokenizer.incrementToken()) { // No remaining substring from a token with embedded punctuation: save // the starting offset reported by the base tokenizer as the current // offset, then proceed with the analysis of token it returned. currentOffset = offsetAttr.startOffset(); adjustAttributes(termAttr.toString()); } else { // No more tokens in the underlying token stream: return false return false; } return true; } private void adjustAttributes(final String token) { Matcher m = PUNCTUATION_SYMBOLS.matcher(token); if (m.find()) { int index = m.start(); offsetAttr.setOffset(currentOffset, currentOffset + index); termAttr.copyBuffer(token.toCharArray(), 0, index); tokenAfterPunctuation = token.substring(index + 1); // Given that the incoming token had an embedded punctuation mark, // the starting offset for the substring following the punctuation // mark will be 1 beyond the end of the current token, which is the // substring preceding embedded punctuation mark. currentOffset = offsetAttr.endOffset() + 1; } else if (tokenAfterPunctuation != null) { // Last remaining substring following a previously detected embedded // punctuation mark: adjust attributes based on its values. int length = tokenAfterPunctuation.length(); termAttr.copyBuffer(tokenAfterPunctuation.toCharArray(), 0, length); offsetAttr.setOffset(currentOffset, currentOffset + length); tokenAfterPunctuation = null; } // Implied else: neither is true so attributes from base tokenizer need // no adjustments. } } } Solr throws the following error, in the 'else if' block of #incrementToken 2013-04-29 14:19:48,920 [http-thread-pool-8080(3)] ERROR org.apache.solr.core.SolrCore - java.lang.NullPointerException at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180) at some.other.solr.analysis.EmbeddedPunctuationTokenizer.incrementToken(EmbeddedPunctuationTokenizer.java:83) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.en.EnglishPossessiveFilter.incrementToken(EnglishPossessiveFilter.java:57) at org.apache.lucene.analysis.en.EnglishMinimalStemFilter.incrementToken(EnglishMinimalStemFilter.java:48) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102) at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:477) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:217) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:161) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:331) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:231) at com.sun.enterprise.v3.services.impl.ContainerMapper$AdapterCallable.call(ContainerMapper.java:317) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:195) at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:860) at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:757) at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1056) at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:229) at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90) at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79) at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54) at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59) at com.sun.grizzly.ContextTask.run(ContextTask.java:71) at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532) at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513) at java.lang.Thread.run(Thread.java:722)