Re: XML Stripping from DIH

Otis Gospodnetic Sun, 20 Feb 2011 03:59:29 -0800

Ron,

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory



Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: "Olson, Ron" <rol...@lbpc.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Fri, February 18, 2011 4:05:15 PM
> Subject: XML Stripping from DIH
> 
> Hi all-
> 
> I have some XML in a database that I am trying to index and  store; I am 
>interested in the various pieces of text, but none of the tags. I've  been 
>trying to figure out a way to strip all the tags out, but haven't found  
>anything within Solr to do so; the XML parser seems to want XPath to get the  
>various element values, when all I want is to turn the whole thing into one 
>blob  
>of text, regardless of whether it makes any "contextual" sense.
> 
> Is there  something in Solr to do this, or is it something I'd have to write 
>myself (which  I'm willing to do if necessary)?
> 
> Thanks for any  info,
> 
> Ron
> 
> DISCLAIMER: This electronic message, including any  attachments, files or 
>documents, is intended only for the addressee and may  contain CONFIDENTIAL, 
>PROPRIETARY or LEGALLY PRIVILEGED information.  If  you are not the intended 
>recipient, you are hereby notified that any use,  disclosure, copying or 
>distribution of this message or any of the information  included in or with it 
>is  unauthorized and strictly prohibited.  If  you have received this message 
>in 
>error, please notify the sender immediately by  reply e-mail and permanently 
>delete and destroy this message and its  attachments, along with any copies 
>thereof. This message does not create any  contractual obligation on behalf of 
>the sender or Law Bulletin Publishing  Company.
> Thank you.
>

Re: XML Stripping from DIH

Reply via email to