Re: String bytes can be at most 32766 characters in length?

Alexandre Rafalovitch Wed, 02 Sep 2015 18:47:58 -0700

And that's because you have an incomplete chain. If you look at the
full example in solrconfig.xml, it shows:
     <updateRequestProcessorChain name="dedupe">
       <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">id</str>
         <bool name="overwriteDupes">false</bool>
         <str name="fields">name,features,cat</str>
         <str name="signatureClass">solr.processor.Lookup3Signature</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>



Notice, the last two processors. If you don't have those, nothing gets
indexed. You chain is missing them, for whatever reasons. Try adding
them back in, reloading the core and reindexing.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 2 September 2015 at 11:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
> Hi Erick,
>
> Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
> that, which is the indexing stops working once I put in the following
> De-Duplication code in solrconfig.xml. The problem seems to be with this <str
> name="update.chain">dedupe</str> line.
>
>   <requestHandler name="/update" class="solr.UpdateRequestHandler">
>   <lst name="defaults">
> <str name="update.chain">dedupe</str>
>   </lst>
>   </requestHandler>
>
>
>     <updateRequestProcessorChain name="dedupe">
>   <processor class="solr.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">content</str>
> <str name="signatureClass">solr.processor.Lookup3Signature</str>
>   </processor>
> </updateRequestProcessorChain>
>
>
> Regards,
> Edwin
>
> On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Yes, that is an intentional limit for the size of a single token,
>> which strings are.
>>
>> Why not use deduplication? See:
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>>
>> You don't have to replace the existing documents, and Solr will
>> compute a hash that can be used to identify identical documents
>> and you can use_that_.
>>
>> Best
>> Erick
>>
>> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
>> <edwinye...@gmail.com> wrote:
>> > Hi,
>> >
>> > I would like to check, is the string bytes must be at most 32766
>> characters
>> > in length?
>> >
>> > I'm trying to do a copyField of my rich-text documents content to a field
>> > with fieldType=string to try out my getting distinct result for content,
>> as
>> > there are several documents with the exact same content, and we only want
>> > to list one of them during searching.
>> >
>> > However, I get the following errors in some of the documents when I tried
>> > to index them with the copyField. Some of my documents are quite large in
>> > size, and there is a possibility that it exceed 32766 characters. Is
>> there
>> > any other ways to overcome this problem?
>> >
>> >
>> > org.apache.solr.common.SolrException: Exception writing document id
>> > collection1_polymer100 to the index; possible analysis error.
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
>> > at
>> >
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
>> > at
>> >
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
>> > at
>> >
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> > at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> > at
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> > at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>> > at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>> > at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>> > at
>> >
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>> > at
>> >
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> > at org.eclipse.jetty.server.Server.handle(Server.java:497)
>> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>> > at
>> >
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>> > at
>> >
>> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>> > at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>> > at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.lang.IllegalArgumentException: Document contains at least
>> > one immense term in field="signature" (whose UTF8 encoding is longer than
>> > the max length 32766), all of which were skipped.  Please correct the
>> > analyzer to not produce such terms.  The prefix of the first immense term
>> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56,
>> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...',
>> > original message: bytes can be at most 32766 in length; got 49960
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670)
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
>> > at
>> >
>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
>> > at
>> >
>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
>> > at
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
>> > ... 38 more
>> > Caused by:
>> > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
>> bytes
>> > can be at most 32766 in length; got 49960
>> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
>> > at
>> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660)
>> > ... 45 more
>> >
>> >
>> > Regards,
>> > Edwin
>>

Re: String bytes can be at most 32766 characters in length?

Reply via email to