RE: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)

Musshorn, Kris T CTR USARMY RDECOM ARL (US) Fri, 05 Aug 2016 12:56:42 -0700

CLASSIFICATION: UNCLASSIFIED

What I did was force nutch to truncate content to 32765 max before indexing 
into solr and it solved my problem.

Thanks,
Kris

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.      
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn....@mail.mil
~~~~~~~~~~~~~~~~~~~~~~~~~~

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, August 05, 2016 3:29 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the 
identity of the sender, and confirm the authenticity of all links contained 
within the message prior to copying and pasting the address to a Web browser.  

----

what that error is telling you is that you have an unanalyzed term that is, 
well, huge (i..e > 32K). Is your "content" field by chance a "string" type? 
It's very rare that a term > 32K is actually useful.
You can't search on it except with, say, wildcards,there's no stemming etc. So 
the first question is whether the "content" field is appropriately defined in 
your schema for your use case.

If your content field is some kind of text-based field (i.e.
solr.Textfield), then the second issue may be that you just have wonky data 
coming in, say a base-64 encoded image or something scraped from somewhere. In 
that case you need to NOT index it. You can try Or try LengthFilterFactory, see:
Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory.

This is a fundamental limitation enforced at the Lucene layer, so if that 
doesn't work, the only real solution is "don't do that". You'll have to 
intercept the doc and omit that data, perhaps write a custom update processor 
to throw out huge fields or the like.

Best,
Erick

On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
<kris.t.musshorn....@mail.mil> wrote:
> CLASSIFICATION: UNCLASSIFIED
>
> I am trying to index from nutch 1.12 to SOLR 6.1.0.
> Got this error.
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
> Error from server at Caution-http://localhost:8983/solr/ARLInside: 
> Exception writing document id 
> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v
> ol.3.2/index.cfm to the index; possible analysis error: Document 
> contains at least one immense term in field="content" (whose UTF8 
> encoding is longer than the max length 32766
>
> How to correct?
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn....@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> CLASSIFICATION: UNCLASSIFIED

CLASSIFICATION: UNCLASSIFIED

RE: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)

Reply via email to