Ron, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: "Olson, Ron" <rol...@lbpc.com> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Fri, February 18, 2011 4:05:15 PM > Subject: XML Stripping from DIH > > Hi all- > > I have some XML in a database that I am trying to index and store; I am >interested in the various pieces of text, but none of the tags. I've been >trying to figure out a way to strip all the tags out, but haven't found >anything within Solr to do so; the XML parser seems to want XPath to get the >various element values, when all I want is to turn the whole thing into one >blob >of text, regardless of whether it makes any "contextual" sense. > > Is there something in Solr to do this, or is it something I'd have to write >myself (which I'm willing to do if necessary)? > > Thanks for any info, > > Ron > > DISCLAIMER: This electronic message, including any attachments, files or >documents, is intended only for the addressee and may contain CONFIDENTIAL, >PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended >recipient, you are hereby notified that any use, disclosure, copying or >distribution of this message or any of the information included in or with it >is unauthorized and strictly prohibited. If you have received this message >in >error, please notify the sender immediately by reply e-mail and permanently >delete and destroy this message and its attachments, along with any copies >thereof. This message does not create any contractual obligation on behalf of >the sender or Law Bulletin Publishing Company. > Thank you. >