Re: custom tokenizer error

Sarita Nair Mon, 06 May 2013 09:03:34 -0700

baseTokenizer is reset in the #reset method.

Sarita

________________________________
 From: Jack Krupansky <j...@basetechnology.com>
To: solr-user@lucene.apache.org 
Sent: Sunday, May 5, 2013 1:37 PM
Subject: Re: custom tokenizer error

I didn't notice any call to the "reset" method for your base tokenizer.

Is there any reason that you didn't just use char filters to replace colon 
and periods with spaces?

-- Jack Krupansky

-----Original Message----- 
From: Sarita Nair
Sent: Friday, May 03, 2013 2:43 PM
To: solr-user@lucene.apache.org
Subject: custom tokenizer error

I am using a custom Tokenizer, as part of analysis chain, for a Solr (4.2.1) 
field. On trying to index, Solr throws a NullPointerException.
The unit tests for the custom tokenizer work fine. Any ideas as to what is 
it that I am missing/doing incorrectly will be appreciated.

Here is the relevant schema.xml excerpt:

    <fieldType name="negated" class="solr.TextField" omitNorms="true" >
    <analyzer type="index">
    <tokenizer 
class="some.other.solr.analysis.EmbeddedPunctuationTokenizer$Factory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    </analyzer>
    </fieldType>

Here are the relevant pieces of the Tokenizer:

    /**
     * Intercepts each token produced by {@link 
StandardTokenizer#incrementToken()}
     * and checks for the presence of a colon or period. If found, splits 
the token
     * on the punctuation mark and adjusts the term and offset attributes of 
the
     * underlying {@link TokenStream} to create additional tokens.
     *
     *
     */
    public class EmbeddedPunctuationTokenizer extends Tokenizer {
private static final Pattern PUNCTUATION_SYMBOLS = Pattern.compile("[:.]");
private StandardTokenizer baseTokenizer;
       private CharTermAttribute termAttr;

private OffsetAttribute offsetAttr;

private /*@Nullable*/ String tokenAfterPunctuation = null;

private int currentOffset = 0;

public EmbeddedPunctuationTokenizer(final Reader reader) {
super(reader);
baseTokenizer = new StandardTokenizer(Version.MINIMUM_LUCENE_VERSION, 
reader);
// Two TokenStreams are in play here: the one underlying the current
// instance and the one underlying the StandardTokenizer. The attribute
// instances must be associated with both.
termAttr = baseTokenizer.addAttribute(CharTermAttribute.class);
offsetAttr = baseTokenizer.addAttribute(OffsetAttribute.class);
this.addAttributeImpl((CharTermAttributeImpl)termAttr);
this.addAttributeImpl((OffsetAttributeImpl)offsetAttr);
}

@Override
public void end() throws IOException {
baseTokenizer.end();
super.end();
}

@Override
public void close() throws IOException {
baseTokenizer.close();
super.close();
}

@Override
public void reset() throws IOException {
super.reset();
baseTokenizer.reset();
currentOffset = 0;
tokenAfterPunctuation = null;
}

@Override
public final boolean incrementToken() throws IOException {
clearAttributes();
if (tokenAfterPunctuation != null) {
// Do not advance the underlying TokenStream if the previous call
// found an embedded punctuation mark and set aside the substring
// that follows it. Set the attributes instead from the substring,
// bearing in mind that the substring could contain more embedded
// punctuation marks.
adjustAttributes(tokenAfterPunctuation);
} else if (baseTokenizer.incrementToken()) {
// No remaining substring from a token with embedded punctuation: save
// the starting offset reported by the base tokenizer as the current
// offset, then proceed with the analysis of token it returned.
currentOffset = offsetAttr.startOffset();
adjustAttributes(termAttr.toString());
} else {
// No more tokens in the underlying token stream: return false
return false;
}
return true;
}

           private void adjustAttributes(final String token) {
Matcher m = PUNCTUATION_SYMBOLS.matcher(token);
if (m.find()) {
int index = m.start();
offsetAttr.setOffset(currentOffset, currentOffset + index);
termAttr.copyBuffer(token.toCharArray(), 0, index);
tokenAfterPunctuation = token.substring(index + 1);
// Given that the incoming token had an embedded punctuation mark,
// the starting offset for the substring following the punctuation
// mark will be 1 beyond the end of the current token, which is the
// substring preceding embedded punctuation mark.
currentOffset = offsetAttr.endOffset() + 1;
} else if (tokenAfterPunctuation != null) {
// Last remaining substring following a previously detected embedded
// punctuation mark: adjust attributes based on its values.
int length = tokenAfterPunctuation.length();
termAttr.copyBuffer(tokenAfterPunctuation.toCharArray(), 0, length);
offsetAttr.setOffset(currentOffset, currentOffset + length);
tokenAfterPunctuation = null;
}
// Implied else: neither is true so attributes from base tokenizer need
// no adjustments.
}

}
}

Solr throws the following error, in the 'else if' block of #incrementToken

    2013-04-29 14:19:48,920 [http-thread-pool-8080(3)] ERROR 
org.apache.solr.core.SolrCore - java.lang.NullPointerException
    at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
    at 
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
    at 
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180)
    at 
some.other.solr.analysis.EmbeddedPunctuationTokenizer.incrementToken(EmbeddedPunctuationTokenizer.java:83)
    at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at 
org.apache.lucene.analysis.en.EnglishPossessiveFilter.incrementToken(EnglishPossessiveFilter.java:57)
    at 
org.apache.lucene.analysis.en.EnglishMinimalStemFilter.incrementToken(EnglishMinimalStemFilter.java:48)
    at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:102)
    at 
org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
    at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
    at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
    at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
    at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:201)
    at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
    at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
    at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:477)
    at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
    at 
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
    at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
    at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
    at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
    at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
    at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:217)
    at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:279)
    at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
    at 
org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:655)
    at 
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:595)
    at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:161)
    at 
org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:331)
    at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:231)
    at 
com.sun.enterprise.v3.services.impl.ContainerMapper$AdapterCallable.call(ContainerMapper.java:317)
    at 
com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:195)
    at 
com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:860)
    at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:757)
    at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1056)
    at 
com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:229)
    at 
com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
    at 
com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
    at 
com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
    at 
com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79)
    at 
com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
    at 
com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
    at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
    at 
com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
    at 
com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
    at java.lang.Thread.run(Thread.java:722)

Re: custom tokenizer error

Reply via email to