OK, I contributed it at:
https://issues.apache.org/jira/browse/SOLR-887

I changed it to use Solr class org.apache.solr.analysis.HTMLStripReader

Thank you all.

Ahmed



On Tue, Nov 18, 2008 at 5:49 AM, Noble Paul നോബിള്‍ नोब्ळ् <
[EMAIL PROTECTED]> wrote:

> On Tue, Nov 18, 2008 at 2:49 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
> > Hi All,
> >
> > Although the HTMLStripStandardTokenizerFactory will remove HTML tags, it
> > will be stored in the index and needed to be removed while searching. In
> my
> > case the HTML tags has no need at all. So I created HTMLStripTransformer
> for
> > the DIH to remove the HTML tags and save space on the index. I have used
> the
> > HTML parser included with Lucene ( org.apache.lucene.demo.html). It is
> well
> > performing and worked with me (while working with Lucene before moving to
> > Solr)
> >
> > What do you think? Does it worth contribution?
> Yes. You can contribute this new transformer as an enhancement .
> >
> > My best wishes,
> >
> > Regards,
> > Ahmed
> >
> > On Thu, Nov 6, 2008 at 2:39 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> >
> >> There is a nice HTML stripper inside Solr.
> >> "solr.HTMLStripStandardTokenizerFactory"
> >>
> >> -----Original Message-----
> >> From: Ahmed Hammad [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, November 05, 2008 10:43 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Regex Transformer Error
> >>
> >> Hi,
> >>
> >> It works with the attribute regex="&lt;(.|\n)*?&gt;"
> >>
> >> Sorry for the disturbance.
> >>
> >> Regards,
> >>
> >> ahmd
> >>
> >>
> >> On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
> >>
> >> > Hi,
> >> >
> >> > I am using Solr 1.3 data import handler. One of my table fields has
> >> > html tags, I want to strip it of the field text. So obviously I need
> >> > the Regex Transformer.
> >> >
> >> > I added transformer="RegexTransformer" attribute to my entity and a
> >> > new field with:
> >> >
> >> > <field sourceColName="content" column="content" regex="English"
> >> > replaceWith="XXXXX"/>
> >> >
> >> > Every thing works fine. The text is replace without any problem. The
> >> > provlem happend with my regular experession to strip html tags. So I
> >> > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not
> >> > allowed in XML. I tried the following regex="&lt;(.|\n)*?&gt;" and
> >> > regex="&#3C;(.|\n)*?&#3E;" but I get the following error:
> >> >
> >> > The value of attribute "regex" associated with an element type "field"
> >>
> >> > must not contain the '<' character. at
> >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> >> > Source) ...
> >> >
> >> > The full stack trace is following:
> >> >
> >> > *FATAL: Could not create importer. DataImporter config invalid
> >> > org.apache.solr.common.SolrException: FATAL: Could not create
> >> importer.
> >> > DataImporter config invalid at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> >> > Handler.java:114)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> >> > (DataImportHandler.java:206)
> >> > at
> >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> >> > rBase.java:131) at
> >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> >> > java:303)
> >> > at
> >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> >> > .java:232)
> >> > at
> >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> >> > cationFilterChain.java:235)
> >> > at
> >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> >> > lterChain.java:206)
> >> > at
> >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> >> > lve.java:233)
> >> > at
> >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> >> > lve.java:191)
> >> > at
> >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> >> > va:128)
> >> > at
> >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> >> > va:102)
> >> > at
> >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
> >> > e.java:109)
> >> > at
> >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> >> > :286)
> >> > at
> >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
> >> > .java:857)
> >> > at
> >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
> >> > cess(Http11AprProtocol.java:565) at
> >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
> >> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
> >> > org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> > Exception occurred while initializing context Processing Document # at
> >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> >> > orter.java:176)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja
> >> > va:93)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> >> > Handler.java:106) ... 17 more Caused by:
> >> > org.xml.sax.SAXParseException: The value of attribute "regex"
> >> > associated with an element type "field" must not contain the '<'
> >> > character. at
> >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> >> > Source) at
> >> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
> >> > own
> >> > Source) at
> >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> >> > orter.java:166)
> >> > ... 19 more *
> >> >
> >> > *description* *The server encountered an internal error (FATAL: Could
> >> > not create importer. DataImporter config invalid
> >> > org.apache.solr.common.SolrException: FATAL: Could not create
> >> importer.
> >> > DataImporter config invalid at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> >> > Handler.java:114)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> >> > (DataImportHandler.java:206)
> >> > at
> >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> >> > rBase.java:131) at
> >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> >> > java:303)
> >> > at
> >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> >> > .java:232)
> >> > at
> >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> >> > cationFilterChain.java:235)
> >> > at
> >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> >> > lterChain.java:206)
> >> > at
> >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> >> > lve.java:233)
> >> > at
> >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> >> > lve.java:191)
> >> > at
> >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> >> > va:128)
> >> > at
> >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> >> > va:102)
> >> > at
> >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
> >> > e.java:109)
> >> > at
> >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> >> > :286)
> >> > at
> >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
> >> > .java:857)
> >> > at
> >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
> >> > cess(Http11AprProtocol.java:565) at
> >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
> >> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
> >> > org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> > Exception occurred while initializing context Processing Document # at
> >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> >> > orter.java:176)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja
> >> > va:93)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> >> > Handler.java:106) ... 17 more Caused by:
> >> > org.xml.sax.SAXParseException: The value of attribute "regex"
> >> > associated with an element type "field" must not contain the '<'
> >> > character. at
> >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> >> > Source) at
> >> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
> >> > own
> >> > Source) at
> >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> >> > orter.java:166) ... 19 more ) that prevented it from fulfilling this
> >> > request.*
> >> >
> >> > I appreciate your help.
> >> >
> >> > Regards,
> >> > ahmd
> >> >
> >> >
> >>
> >
>
>
>
> --
> --Noble Paul
>

Reply via email to