OK, I contributed it at: https://issues.apache.org/jira/browse/SOLR-887
I changed it to use Solr class org.apache.solr.analysis.HTMLStripReader Thank you all. Ahmed On Tue, Nov 18, 2008 at 5:49 AM, Noble Paul നോബിള് नोब्ळ् < [EMAIL PROTECTED]> wrote: > On Tue, Nov 18, 2008 at 2:49 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: > > Hi All, > > > > Although the HTMLStripStandardTokenizerFactory will remove HTML tags, it > > will be stored in the index and needed to be removed while searching. In > my > > case the HTML tags has no need at all. So I created HTMLStripTransformer > for > > the DIH to remove the HTML tags and save space on the index. I have used > the > > HTML parser included with Lucene ( org.apache.lucene.demo.html). It is > well > > performing and worked with me (while working with Lucene before moving to > > Solr) > > > > What do you think? Does it worth contribution? > Yes. You can contribute this new transformer as an enhancement . > > > > My best wishes, > > > > Regards, > > Ahmed > > > > On Thu, Nov 6, 2008 at 2:39 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > > > >> There is a nice HTML stripper inside Solr. > >> "solr.HTMLStripStandardTokenizerFactory" > >> > >> -----Original Message----- > >> From: Ahmed Hammad [mailto:[EMAIL PROTECTED] > >> Sent: Wednesday, November 05, 2008 10:43 AM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Regex Transformer Error > >> > >> Hi, > >> > >> It works with the attribute regex="<(.|\n)*?>" > >> > >> Sorry for the disturbance. > >> > >> Regards, > >> > >> ahmd > >> > >> > >> On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: > >> > >> > Hi, > >> > > >> > I am using Solr 1.3 data import handler. One of my table fields has > >> > html tags, I want to strip it of the field text. So obviously I need > >> > the Regex Transformer. > >> > > >> > I added transformer="RegexTransformer" attribute to my entity and a > >> > new field with: > >> > > >> > <field sourceColName="content" column="content" regex="English" > >> > replaceWith="XXXXX"/> > >> > > >> > Every thing works fine. The text is replace without any problem. The > >> > provlem happend with my regular experession to strip html tags. So I > >> > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not > >> > allowed in XML. I tried the following regex="<(.|\n)*?>" and > >> > regex="C;(.|\n)*?E;" but I get the following error: > >> > > >> > The value of attribute "regex" associated with an element type "field" > >> > >> > must not contain the '<' character. at > >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > >> > Source) ... > >> > > >> > The full stack trace is following: > >> > > >> > *FATAL: Could not create importer. DataImporter config invalid > >> > org.apache.solr.common.SolrException: FATAL: Could not create > >> importer. > >> > DataImporter config invalid at > >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > >> > Handler.java:114) > >> > at > >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody > >> > (DataImportHandler.java:206) > >> > at > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > >> > rBase.java:131) at > >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > >> > java:303) > >> > at > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > >> > .java:232) > >> > at > >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > >> > cationFilterChain.java:235) > >> > at > >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > >> > lterChain.java:206) > >> > at > >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa > >> > lve.java:233) > >> > at > >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa > >> > lve.java:191) > >> > at > >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja > >> > va:128) > >> > at > >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja > >> > va:102) > >> > at > >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv > >> > e.java:109) > >> > at > >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > >> > :286) > >> > at > >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor > >> > .java:857) > >> > at > >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro > >> > cess(Http11AprProtocol.java:565) at > >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 > >> > 9) at java.lang.Thread.run(Unknown Source) Caused by: > >> > org.apache.solr.handler.dataimport.DataImportHandlerException: > >> > Exception occurred while initializing context Processing Document # at > >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > >> > orter.java:176) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja > >> > va:93) > >> > at > >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > >> > Handler.java:106) ... 17 more Caused by: > >> > org.xml.sax.SAXParseException: The value of attribute "regex" > >> > associated with an element type "field" must not contain the '<' > >> > character. at > >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > >> > Source) at > >> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn > >> > own > >> > Source) at > >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > >> > orter.java:166) > >> > ... 19 more * > >> > > >> > *description* *The server encountered an internal error (FATAL: Could > >> > not create importer. DataImporter config invalid > >> > org.apache.solr.common.SolrException: FATAL: Could not create > >> importer. > >> > DataImporter config invalid at > >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > >> > Handler.java:114) > >> > at > >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody > >> > (DataImportHandler.java:206) > >> > at > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > >> > rBase.java:131) at > >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > >> > java:303) > >> > at > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > >> > .java:232) > >> > at > >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > >> > cationFilterChain.java:235) > >> > at > >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > >> > lterChain.java:206) > >> > at > >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa > >> > lve.java:233) > >> > at > >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa > >> > lve.java:191) > >> > at > >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja > >> > va:128) > >> > at > >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja > >> > va:102) > >> > at > >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv > >> > e.java:109) > >> > at > >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > >> > :286) > >> > at > >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor > >> > .java:857) > >> > at > >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro > >> > cess(Http11AprProtocol.java:565) at > >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 > >> > 9) at java.lang.Thread.run(Unknown Source) Caused by: > >> > org.apache.solr.handler.dataimport.DataImportHandlerException: > >> > Exception occurred while initializing context Processing Document # at > >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > >> > orter.java:176) > >> > at > >> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja > >> > va:93) > >> > at > >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > >> > Handler.java:106) ... 17 more Caused by: > >> > org.xml.sax.SAXParseException: The value of attribute "regex" > >> > associated with an element type "field" must not contain the '<' > >> > character. at > >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > >> > Source) at > >> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn > >> > own > >> > Source) at > >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > >> > orter.java:166) ... 19 more ) that prevented it from fulfilling this > >> > request.* > >> > > >> > I appreciate your help. > >> > > >> > Regards, > >> > ahmd > >> > > >> > > >> > > > > > > -- > --Noble Paul >