As Eric mentioned you may want to check your analysis chain and see if you
are not using *KeywordTokenizer* for content type /  content type is String
in your schema.xml. I have seen similar errors before due
to KeywordTokenizer being used.

Thanks,
Susheel

On Fri, Aug 5, 2016 at 11:46 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You also need to find out _why_ you're trying to index such huge
> tokens, they indicate that something you're ingesting isn't
> reasonable....
>
> Just truncating the input will index things, true. But a 32K token is
> unexpected, and indicates what's in your index may not be what you
> expect and may not be useful.
>
> But you know what you're indexing best, this is just a general statement.
>
> Erick
>
> On Fri, Aug 5, 2016 at 12:55 PM, Musshorn, Kris T CTR USARMY RDECOM
> ARL (US) <kris.t.musshorn....@mail.mil> wrote:
> > CLASSIFICATION: UNCLASSIFIED
> >
> > What I did was force nutch to truncate content to 32765 max before
> indexing into solr and it solved my problem.
> >
> >
> > Thanks,
> > Kris
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor – Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn....@mail.mil
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Friday, August 05, 2016 3:29 PM
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)
> >
> > All active links contained in this email were disabled.  Please verify
> the identity of the sender, and confirm the authenticity of all links
> contained within the message prior to copying and pasting the address to a
> Web browser.
> >
> >
> >
> >
> > ----
> >
> > what that error is telling you is that you have an unanalyzed term that
> is, well, huge (i..e > 32K). Is your "content" field by chance a "string"
> type? It's very rare that a term > 32K is actually useful.
> > You can't search on it except with, say, wildcards,there's no stemming
> etc. So the first question is whether the "content" field is appropriately
> defined in your schema for your use case.
> >
> > If your content field is some kind of text-based field (i.e.
> > solr.Textfield), then the second issue may be that you just have wonky
> data coming in, say a base-64 encoded image or something scraped from
> somewhere. In that case you need to NOT index it. You can try Or try
> LengthFilterFactory, see:
> > Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilter
> s#solr.LengthFilterFactory.
> >
> > This is a fundamental limitation enforced at the Lucene layer, so if
> that doesn't work, the only real solution is "don't do that". You'll have
> to intercept the doc and omit that data, perhaps write a custom update
> processor to throw out huge fields or the like.
> >
> > Best,
> > Erick
> >
> >
> > On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL
> (US) <kris.t.musshorn....@mail.mil> wrote:
> >> CLASSIFICATION: UNCLASSIFIED
> >>
> >> I am trying to index from nutch 1.12 to SOLR 6.1.0.
> >> Got this error.
> >> java.lang.Exception:
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error from server at Caution-http://localhost:8983/solr/ARLInside:
> >> Exception writing document id
> >> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v
> >> ol.3.2/index.cfm to the index; possible analysis error: Document
> >> contains at least one immense term in field="content" (whose UTF8
> >> encoding is longer than the max length 32766
> >>
> >> How to correct?
> >>
> >> Thanks,
> >> Kris
> >>
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> Kris T. Musshorn
> >> FileMaker Developer - Contractor - Catapult Technology Inc.
> >> US Army Research Lab
> >> Aberdeen Proving Ground
> >> Application Management & Development Branch
> >> 410-278-7251
> >> kris.t.musshorn....@mail.mil
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>
> >>
> >>
> >> CLASSIFICATION: UNCLASSIFIED
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>

Reply via email to