Hi Erick, Yes, i'm trying out the De-Duplication too. But I'm facing a problem with that, which is the indexing stops working once I put in the following De-Duplication code in solrconfig.xml. The problem seems to be with this <str name="update.chain">dedupe</str> line.
<requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">dedupe</str> </lst> </requestHandler> <updateRequestProcessorChain name="dedupe"> <processor class="solr.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <str name="signatureField">signature</str> <bool name="overwriteDupes">false</bool> <str name="fields">content</str> <str name="signatureClass">solr.processor.Lookup3Signature</str> </processor> </updateRequestProcessorChain> Regards, Edwin On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com> wrote: > Yes, that is an intentional limit for the size of a single token, > which strings are. > > Why not use deduplication? See: > https://cwiki.apache.org/confluence/display/solr/De-Duplication > > You don't have to replace the existing documents, and Solr will > compute a hash that can be used to identify identical documents > and you can use_that_. > > Best > Erick > > On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo > <edwinye...@gmail.com> wrote: > > Hi, > > > > I would like to check, is the string bytes must be at most 32766 > characters > > in length? > > > > I'm trying to do a copyField of my rich-text documents content to a field > > with fieldType=string to try out my getting distinct result for content, > as > > there are several documents with the exact same content, and we only want > > to list one of them during searching. > > > > However, I get the following errors in some of the documents when I tried > > to index them with the copyField. Some of my documents are quite large in > > size, and there is a possibility that it exceed 32766 characters. Is > there > > any other ways to overcome this problem? > > > > > > org.apache.solr.common.SolrException: Exception writing document id > > collection1_polymer100 to the index; possible analysis error. > > at > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167) > > at > > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) > > at > > > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > > at > > > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955) > > at > > > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110) > > at > > > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706) > > at > > > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104) > > at > > > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > > at > > > org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207) > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122) > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127) > > at > > > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235) > > at > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) > > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) > > at > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > > at > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > at > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) > > at > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) > > at > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > > at > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > at > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > at > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > > at > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) > > at > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > > at org.eclipse.jetty.server.Server.handle(Server.java:497) > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) > > at > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > > at > > > org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.IllegalArgumentException: Document contains at least > > one immense term in field="signature" (whose UTF8 encoding is longer than > > the max length 32766), all of which were skipped. Please correct the > > analyzer to not produce such terms. The prefix of the first immense term > > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56, > > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...', > > original message: bytes can be at most 32766 in length; got 49960 > > at > > > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670) > > at > > > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) > > at > > > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) > > at > > > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) > > at > > > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) > > at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) > > at > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239) > > at > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) > > ... 38 more > > Caused by: > > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: > bytes > > can be at most 32766 in length; got 49960 > > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) > > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) > > at > > > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660) > > ... 45 more > > > > > > Regards, > > Edwin >