Thanks for your advice Alexandre. On 3 September 2015 at 20:29, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> Probably because your signatureField and your fields are the same! You > need to point signatureField at a new (not-ID) field. > > You will still get duplicates, as you requested that in your other > emails, but now you would be able to group on that new signature > field. > > If you have any further problems, you also need to start a new thread > with a new subject, as the current question is no longer related. > > Regards, > Alex. > ---- > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 2 September 2015 at 22:21, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > Hi Alexandre, > > > > Thanks for pointing out the error. I'm able to get the documents to be > > indexed after adding in the two processors. > > > > However, I'm still seeing all the similar documents being search in the > > content without being de-duplicated. My content is currently indexed as > > fieldType=text_general. > > > > <updateRequestProcessorChain name="dedupe"> > > <processor class="solr.processor.SignatureUpdateProcessorFactory"> > > <bool name="enabled">true</bool> > > <str name="signatureField">content</str> > > <bool name="overwriteDupes">false</bool> > > <str name="fields">content</str> > > <str name="signatureClass">solr.processor.Lookup3Signature</str> > > </processor> > > <processor class="solr.LogUpdateProcessorFactory" /> > > <processor class="solr.RunUpdateProcessorFactory" /> > > </updateRequestProcessorChain> > > > > Regards, > > Edwin > > > > > > On 3 September 2015 at 09:46, Alexandre Rafalovitch <arafa...@gmail.com> > > wrote: > > > >> And that's because you have an incomplete chain. If you look at the > >> full example in solrconfig.xml, it shows: > >> <updateRequestProcessorChain name="dedupe"> > >> <processor > class="solr.processor.SignatureUpdateProcessorFactory"> > >> <bool name="enabled">true</bool> > >> <str name="signatureField">id</str> > >> <bool name="overwriteDupes">false</bool> > >> <str name="fields">name,features,cat</str> > >> <str > name="signatureClass">solr.processor.Lookup3Signature</str> > >> </processor> > >> <processor class="solr.LogUpdateProcessorFactory" /> > >> <processor class="solr.RunUpdateProcessorFactory" /> > >> </updateRequestProcessorChain> > >> > >> > >> Notice, the last two processors. If you don't have those, nothing gets > >> indexed. You chain is missing them, for whatever reasons. Try adding > >> them back in, reloading the core and reindexing. > >> > >> Regards, > >> Alex. > >> ---- > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > >> http://www.solr-start.com/ > >> > >> > >> On 2 September 2015 at 11:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com > > > >> wrote: > >> > Hi Erick, > >> > > >> > Yes, i'm trying out the De-Duplication too. But I'm facing a problem > with > >> > that, which is the indexing stops working once I put in the following > >> > De-Duplication code in solrconfig.xml. The problem seems to be with > this > >> <str > >> > name="update.chain">dedupe</str> line. > >> > > >> > <requestHandler name="/update" class="solr.UpdateRequestHandler"> > >> > <lst name="defaults"> > >> > <str name="update.chain">dedupe</str> > >> > </lst> > >> > </requestHandler> > >> > > >> > > >> > <updateRequestProcessorChain name="dedupe"> > >> > <processor class="solr.processor.SignatureUpdateProcessorFactory"> > >> > <bool name="enabled">true</bool> > >> > <str name="signatureField">signature</str> > >> > <bool name="overwriteDupes">false</bool> > >> > <str name="fields">content</str> > >> > <str name="signatureClass">solr.processor.Lookup3Signature</str> > >> > </processor> > >> > </updateRequestProcessorChain> > >> > > >> > > >> > Regards, > >> > Edwin > >> > > >> > On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com > > > >> > wrote: > >> > > >> >> Yes, that is an intentional limit for the size of a single token, > >> >> which strings are. > >> >> > >> >> Why not use deduplication? See: > >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication > >> >> > >> >> You don't have to replace the existing documents, and Solr will > >> >> compute a hash that can be used to identify identical documents > >> >> and you can use_that_. > >> >> > >> >> Best > >> >> Erick > >> >> > >> >> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo > >> >> <edwinye...@gmail.com> wrote: > >> >> > Hi, > >> >> > > >> >> > I would like to check, is the string bytes must be at most 32766 > >> >> characters > >> >> > in length? > >> >> > > >> >> > I'm trying to do a copyField of my rich-text documents content to a > >> field > >> >> > with fieldType=string to try out my getting distinct result for > >> content, > >> >> as > >> >> > there are several documents with the exact same content, and we > only > >> want > >> >> > to list one of them during searching. > >> >> > > >> >> > However, I get the following errors in some of the documents when I > >> tried > >> >> > to index them with the copyField. Some of my documents are quite > >> large in > >> >> > size, and there is a possibility that it exceed 32766 characters. > Is > >> >> there > >> >> > any other ways to overcome this problem? > >> >> > > >> >> > > >> >> > org.apache.solr.common.SolrException: Exception writing document id > >> >> > collection1_polymer100 to the index; possible analysis error. > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) > >> >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) > >> >> > at > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) > >> >> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > >> >> > at > >> >> > >> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > >> >> > at org.eclipse.jetty.server.Server.handle(Server.java:497) > >> >> > at > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > >> >> > at > >> >> > > >> >> > >> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > >> >> > at java.lang.Thread.run(Thread.java:745) > >> >> > Caused by: java.lang.IllegalArgumentException: Document contains at > >> least > >> >> > one immense term in field="signature" (whose UTF8 encoding is > longer > >> than > >> >> > the max length 32766), all of which were skipped. Please correct > the > >> >> > analyzer to not produce such terms. The prefix of the first > immense > >> term > >> >> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, > 62, > >> 56, > >> >> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...', > >> >> > original message: bytes can be at most 32766 in length; got 49960 > >> >> > at > >> >> > > >> >> > >> > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670) > >> >> > at > >> >> > > >> >> > >> > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) > >> >> > at > >> >> > > >> >> > >> > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) > >> >> > at > >> >> > > >> >> > >> > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) > >> >> > at > >> >> > > >> >> > >> > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) > >> >> > at > >> >> > >> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239) > >> >> > at > >> >> > > >> >> > >> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) > >> >> > ... 38 more > >> >> > Caused by: > >> >> > > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: > >> >> bytes > >> >> > can be at most 32766 in length; got 49960 > >> >> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) > >> >> > at > >> >> > >> > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) > >> >> > at > >> >> > > >> >> > >> > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660) > >> >> > ... 45 more > >> >> > > >> >> > > >> >> > Regards, > >> >> > Edwin > >> >> > >> >