Re: String bytes can be at most 32766 characters in length?

Zheng Lin Edwin Yeo Thu, 03 Sep 2015 07:28:22 -0700

Thanks for your advice Alexandre.

On 3 September 2015 at 20:29, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:


> Probably because your signatureField and your fields are the same! You
> need to point signatureField at a new (not-ID) field.
>
> You will still get duplicates, as you requested that in your other
> emails, but now you would be able to group on that new signature
> field.
>
> If you have any further problems, you also need to start a new thread
> with a new subject, as the current question is no longer related.
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 2 September 2015 at 22:21, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> > Hi Alexandre,
> >
> > Thanks for pointing out the error. I'm able to get the documents to be
> > indexed after adding in the two processors.
> >
> > However, I'm still seeing all the similar documents being search in the
> > content without being de-duplicated. My content is currently indexed as
> > fieldType=text_general.
> >
> >     <updateRequestProcessorChain name="dedupe">
> >  <processor class="solr.processor.SignatureUpdateProcessorFactory">
> > <bool name="enabled">true</bool>
> > <str name="signatureField">content</str>
> > <bool name="overwriteDupes">false</bool>
> > <str name="fields">content</str>
> > <str name="signatureClass">solr.processor.Lookup3Signature</str>
> >  </processor>
> > <processor class="solr.LogUpdateProcessorFactory" />
> > <processor class="solr.RunUpdateProcessorFactory" />
> > </updateRequestProcessorChain>
> >
> > Regards,
> > Edwin
> >
> >
> > On 3 September 2015 at 09:46, Alexandre Rafalovitch <arafa...@gmail.com>
> > wrote:
> >
> >> And that's because you have an incomplete chain. If you look at the
> >> full example in solrconfig.xml, it shows:
> >>      <updateRequestProcessorChain name="dedupe">
> >>        <processor
> class="solr.processor.SignatureUpdateProcessorFactory">
> >>          <bool name="enabled">true</bool>
> >>          <str name="signatureField">id</str>
> >>          <bool name="overwriteDupes">false</bool>
> >>          <str name="fields">name,features,cat</str>
> >>          <str
> name="signatureClass">solr.processor.Lookup3Signature</str>
> >>        </processor>
> >>        <processor class="solr.LogUpdateProcessorFactory" />
> >>        <processor class="solr.RunUpdateProcessorFactory" />
> >>      </updateRequestProcessorChain>
> >>
> >>
> >> Notice, the last two processors. If you don't have those, nothing gets
> >> indexed. You chain is missing them, for whatever reasons. Try adding
> >> them back in, reloading the core and reindexing.
> >>
> >> Regards,
> >>    Alex.
> >> ----
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 2 September 2015 at 11:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >
> >> wrote:
> >> > Hi Erick,
> >> >
> >> > Yes, i'm trying out the De-Duplication too. But I'm facing a problem
> with
> >> > that, which is the indexing stops working once I put in the following
> >> > De-Duplication code in solrconfig.xml. The problem seems to be with
> this
> >> <str
> >> > name="update.chain">dedupe</str> line.
> >> >
> >> >   <requestHandler name="/update" class="solr.UpdateRequestHandler">
> >> >   <lst name="defaults">
> >> > <str name="update.chain">dedupe</str>
> >> >   </lst>
> >> >   </requestHandler>
> >> >
> >> >
> >> >     <updateRequestProcessorChain name="dedupe">
> >> >   <processor class="solr.processor.SignatureUpdateProcessorFactory">
> >> > <bool name="enabled">true</bool>
> >> > <str name="signatureField">signature</str>
> >> > <bool name="overwriteDupes">false</bool>
> >> > <str name="fields">content</str>
> >> > <str name="signatureClass">solr.processor.Lookup3Signature</str>
> >> >   </processor>
> >> > </updateRequestProcessorChain>
> >> >
> >> >
> >> > Regards,
> >> > Edwin
> >> >
> >> > On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com
> >
> >> > wrote:
> >> >
> >> >> Yes, that is an intentional limit for the size of a single token,
> >> >> which strings are.
> >> >>
> >> >> Why not use deduplication? See:
> >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >> >>
> >> >> You don't have to replace the existing documents, and Solr will
> >> >> compute a hash that can be used to identify identical documents
> >> >> and you can use_that_.
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
> >> >> <edwinye...@gmail.com> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I would like to check, is the string bytes must be at most 32766
> >> >> characters
> >> >> > in length?
> >> >> >
> >> >> > I'm trying to do a copyField of my rich-text documents content to a
> >> field
> >> >> > with fieldType=string to try out my getting distinct result for
> >> content,
> >> >> as
> >> >> > there are several documents with the exact same content, and we
> only
> >> want
> >> >> > to list one of them during searching.
> >> >> >
> >> >> > However, I get the following errors in some of the documents when I
> >> tried
> >> >> > to index them with the copyField. Some of my documents are quite
> >> large in
> >> >> > size, and there is a possibility that it exceed 32766 characters.
> Is
> >> >> there
> >> >> > any other ways to overcome this problem?
> >> >> >
> >> >> >
> >> >> > org.apache.solr.common.SolrException: Exception writing document id
> >> >> > collection1_polymer100 to the index; possible analysis error.
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
> >> >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
> >> >> > at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
> >> >> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> >> >> > at
> >> >>
> >>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >> >> > at org.eclipse.jetty.server.Server.handle(Server.java:497)
> >> >> > at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> >> >> > at java.lang.Thread.run(Thread.java:745)
> >> >> > Caused by: java.lang.IllegalArgumentException: Document contains at
> >> least
> >> >> > one immense term in field="signature" (whose UTF8 encoding is
> longer
> >> than
> >> >> > the max length 32766), all of which were skipped.  Please correct
> the
> >> >> > analyzer to not produce such terms.  The prefix of the first
> immense
> >> term
> >> >> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114,
> 62,
> >> 56,
> >> >> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...',
> >> >> > original message: bytes can be at most 32766 in length; got 49960
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
> >> >> > at
> >> >>
> >>
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
> >> >> > ... 38 more
> >> >> > Caused by:
> >> >> >
> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
> >> >> bytes
> >> >> > can be at most 32766 in length; got 49960
> >> >> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
> >> >> > at
> >> >>
> >>
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
> >> >> > at
> >> >> >
> >> >>
> >>
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660)
> >> >> > ... 45 more
> >> >> >
> >> >> >
> >> >> > Regards,
> >> >> > Edwin
> >> >>
> >>
>

Re: String bytes can be at most 32766 characters in length?

Reply via email to