RE: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Mark Roberts Tue, 09 Mar 2010 07:24:05 -0800

Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a couple 
of problems:


1) HTML still seems to be getting into my content field

All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" /> to the 
index analyzer for the my "text" fieldType.


2) Some it seems to have broken my highlighting, I get this error:

'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wrong 
exceeds length of provided text sized 3862'



Any ideas how I can fix this?





-----Original Message-----
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: 09 March 2010 04:36
To: solr-user@lucene.apache.org
Subject: Re: HTML encode extracted docs

A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.

If it's just straight HTML, would the HTMLStripCharFilter be good enough?

http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2

On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <mark.robe...@red-gate.com> wrote:
> I'm uploading .htm files to be extracted - some of these files are "include" 
> files that have snippets of HTML rather than fully formed html documents.
>
> solr-cell stores the raw HTML for these items, rather than extracting the 
> text. Is there any way I can get solr to encode this content prior to storing 
> it?
>
> At the moment, I have the problem that when the highlighted snippets are  
> retrieved via search, I need to parse the snippet and HTML encode the bits of 
> HTML that where indexed, whilst *not* encoding the bits that where added by 
> the highlighter, which is messy and time consuming.
>
> Thanks! Mark,
>



-- 
Lance Norskog
goks...@gmail.com

RE: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Reply via email to