Yes, that is an intentional limit for the size of a single token, which strings are.
Why not use deduplication? See: https://cwiki.apache.org/confluence/display/solr/De-Duplication You don't have to replace the existing documents, and Solr will compute a hash that can be used to identify identical documents and you can use_that_. Best Erick On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi, > > I would like to check, is the string bytes must be at most 32766 characters > in length? > > I'm trying to do a copyField of my rich-text documents content to a field > with fieldType=string to try out my getting distinct result for content, as > there are several documents with the exact same content, and we only want > to list one of them during searching. > > However, I get the following errors in some of the documents when I tried > to index them with the copyField. Some of my documents are quite large in > size, and there is a possibility that it exceed 32766 characters. Is there > any other ways to overcome this problem? > > > org.apache.solr.common.SolrException: Exception writing document id > collection1_polymer100 to the index; possible analysis error. > at > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167) > at > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) > at > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706) > at > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104) > at > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > at > org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.eclipse.jetty.server.Server.handle(Server.java:497) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.IllegalArgumentException: Document contains at least > one immense term in field="signature" (whose UTF8 encoding is longer than > the max length 32766), all of which were skipped. Please correct the > analyzer to not produce such terms. The prefix of the first immense term > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56, > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...', > original message: bytes can be at most 32766 in length; got 49960 > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670) > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) > at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) > at > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239) > at > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) > ... 38 more > Caused by: > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes > can be at most 32766 in length; got 49960 > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) > at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660) > ... 45 more > > > Regards, > Edwin