Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a couple of problems:
1) HTML still seems to be getting into my content field All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" /> to the index analyzer for the my "text" fieldType. 2) Some it seems to have broken my highlighting, I get this error: 'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wrong exceeds length of provided text sized 3862' Any ideas how I can fix this? -----Original Message----- From: Lance Norskog [mailto:goks...@gmail.com] Sent: 09 March 2010 04:36 To: solr-user@lucene.apache.org Subject: Re: HTML encode extracted docs A Tika integration with the DataImportHandler is in the Solr trunk. With this, you can copy the raw HTML into different fields and process one copy with Tika. If it's just straight HTML, would the HTMLStripCharFilter be good enough? http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2 On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <mark.robe...@red-gate.com> wrote: > I'm uploading .htm files to be extracted - some of these files are "include" > files that have snippets of HTML rather than fully formed html documents. > > solr-cell stores the raw HTML for these items, rather than extracting the > text. Is there any way I can get solr to encode this content prior to storing > it? > > At the moment, I have the problem that when the highlighted snippets are > retrieved via search, I need to parse the snippet and HTML encode the bits of > HTML that where indexed, whilst *not* encoding the bits that where added by > the highlighter, which is messy and time consuming. > > Thanks! Mark, > -- Lance Norskog goks...@gmail.com