As Eric mentioned you may want to check your analysis chain and see if you are not using *KeywordTokenizer* for content type / content type is String in your schema.xml. I have seen similar errors before due to KeywordTokenizer being used.
Thanks, Susheel On Fri, Aug 5, 2016 at 11:46 PM, Erick Erickson <erickerick...@gmail.com> wrote: > You also need to find out _why_ you're trying to index such huge > tokens, they indicate that something you're ingesting isn't > reasonable.... > > Just truncating the input will index things, true. But a 32K token is > unexpected, and indicates what's in your index may not be what you > expect and may not be useful. > > But you know what you're indexing best, this is just a general statement. > > Erick > > On Fri, Aug 5, 2016 at 12:55 PM, Musshorn, Kris T CTR USARMY RDECOM > ARL (US) <kris.t.musshorn....@mail.mil> wrote: > > CLASSIFICATION: UNCLASSIFIED > > > > What I did was force nutch to truncate content to 32765 max before > indexing into solr and it solved my problem. > > > > > > Thanks, > > Kris > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Kris T. Musshorn > > FileMaker Developer - Contractor – Catapult Technology Inc. > > US Army Research Lab > > Aberdeen Proving Ground > > Application Management & Development Branch > > 410-278-7251 > > kris.t.musshorn....@mail.mil > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > -----Original Message----- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Friday, August 05, 2016 3:29 PM > > To: solr-user <solr-user@lucene.apache.org> > > Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED) > > > > All active links contained in this email were disabled. Please verify > the identity of the sender, and confirm the authenticity of all links > contained within the message prior to copying and pasting the address to a > Web browser. > > > > > > > > > > ---- > > > > what that error is telling you is that you have an unanalyzed term that > is, well, huge (i..e > 32K). Is your "content" field by chance a "string" > type? It's very rare that a term > 32K is actually useful. > > You can't search on it except with, say, wildcards,there's no stemming > etc. So the first question is whether the "content" field is appropriately > defined in your schema for your use case. > > > > If your content field is some kind of text-based field (i.e. > > solr.Textfield), then the second issue may be that you just have wonky > data coming in, say a base-64 encoded image or something scraped from > somewhere. In that case you need to NOT index it. You can try Or try > LengthFilterFactory, see: > > Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilter > s#solr.LengthFilterFactory. > > > > This is a fundamental limitation enforced at the Lucene layer, so if > that doesn't work, the only real solution is "don't do that". You'll have > to intercept the doc and omit that data, perhaps write a custom update > processor to throw out huge fields or the like. > > > > Best, > > Erick > > > > > > On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL > (US) <kris.t.musshorn....@mail.mil> wrote: > >> CLASSIFICATION: UNCLASSIFIED > >> > >> I am trying to index from nutch 1.12 to SOLR 6.1.0. > >> Got this error. > >> java.lang.Exception: > >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > >> Error from server at Caution-http://localhost:8983/solr/ARLInside: > >> Exception writing document id > >> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v > >> ol.3.2/index.cfm to the index; possible analysis error: Document > >> contains at least one immense term in field="content" (whose UTF8 > >> encoding is longer than the max length 32766 > >> > >> How to correct? > >> > >> Thanks, > >> Kris > >> > >> ~~~~~~~~~~~~~~~~~~~~~~~~~~ > >> Kris T. Musshorn > >> FileMaker Developer - Contractor - Catapult Technology Inc. > >> US Army Research Lab > >> Aberdeen Proving Ground > >> Application Management & Development Branch > >> 410-278-7251 > >> kris.t.musshorn....@mail.mil > >> ~~~~~~~~~~~~~~~~~~~~~~~~~~ > >> > >> > >> > >> CLASSIFICATION: UNCLASSIFIED > > > > > > CLASSIFICATION: UNCLASSIFIED >