Context: Solr/Lucene 5.1
Adding documents to Solr core/index through SolrJ

I extract pdf's using tika. The pdf-content is one of the fields of my 
SolrDocuments that are transmitted to Solr using SolrJ.
As not all documents seem to be "coming through" I looked into the Solr-logs 
and see the follwoing exceptions:
org.apache.solr.common.SolrException: Exception writing document id 
fustusermanuals#4614 to the index; possible analysis error.
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
        at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
...
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Document contains at least one 
immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer 
than the max length 32766), all of which were skipped.  Please correct the 
analyzer to not produce such terms.  The prefix of the first immense term is: 
'[10, 32, 10, 32, 10, 10, 70, 82, 32, 77, 111, 100, 101, 32, 100, 39, 101, 109, 
112, 108, 111, 105, 32, 10, 10, 32, 10, 10, 32, 10]...', original message: 
bytes can be at most 32766 in length; got 186493
        at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
        at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
        ... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: 
bytes can be at most 32766 in length; got 186493
        at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
        at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
        at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
        ... 47 more

How can I tell Solr/SolrJ to allow more payload?

I also see some
org.apache.solr.common.SolrException: Exception writing document id 
fustusermanuals#3323 to the index; possible analysis error.
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
        at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
        at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:697)
...
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Document contains at least one 
immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer 
than the max length 32766), all of which were skipped.  Please correct the 
analyzer to not produce such terms.  The prefix of the first immense term is: 
'[10, 69, 78, 32, 76, 67, 68, 32, 116, 101, 108, 101, 118, 105, 115, 105, 111, 
110, 10, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95]...', original message: 
bytes can be at most 32766 in length; got 164683
        at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
        at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
        ... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: 
bytes can be at most 32766 in length; got 164683
        at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
        at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
        at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
        ... 47 more

Which seem result from the same "limitation"

Unfortunately I must extract the pdfs in the my client

Thx
Clemens

Reply via email to