Context: Solr/Lucene 5.1
Adding documents to Solr core/index through SolrJ
I extract pdf's using tika. The pdf-content is one of the fields of my
SolrDocuments that are transmitted to Solr using SolrJ.
As not all documents seem to be "coming through" I looked into the Solr-logs
and see the follwoing exceptions:
org.apache.solr.common.SolrException: Exception writing document id
fustusermanuals#4614 to the index; possible analysis error.
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
...
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Document contains at least one
immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer
than the max length 32766), all of which were skipped. Please correct the
analyzer to not produce such terms. The prefix of the first immense term is:
'[10, 32, 10, 32, 10, 10, 70, 82, 32, 77, 111, 100, 101, 32, 100, 39, 101, 109,
112, 108, 111, 105, 32, 10, 10, 32, 10, 10, 32, 10]...', original message:
bytes can be at most 32766 in length; got 186493
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
bytes can be at most 32766 in length; got 186493
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
... 47 more
How can I tell Solr/SolrJ to allow more payload?
I also see some
org.apache.solr.common.SolrException: Exception writing document id
fustusermanuals#3323 to the index; possible analysis error.
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:170)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:931)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1085)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:697)
...
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalArgumentException: Document contains at least one
immense term in field="content__s_i_suggest" (whose UTF8 encoding is longer
than the max length 32766), all of which were skipped. Please correct the
analyzer to not produce such terms. The prefix of the first immense term is:
'[10, 69, 78, 32, 76, 67, 68, 32, 116, 101, 108, 101, 118, 105, 115, 105, 111,
110, 10, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95]...', original message:
bytes can be at most 32766 in length; got 164683
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1349)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
bytes can be at most 32766 in length; got 164683
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:657)
... 47 more
Which seem result from the same "limitation"
Unfortunately I must extract the pdfs in the my client
Thx
Clemens