Did anybody find a way to fix this more than removing the HTMLStripCharFilter analyzer during the indexing?
Thanks On Sat, Mar 13, 2010 at 7:55 PM, Lance Norskog <goks...@gmail.com> wrote: > HTMLStripCharFilter is only in the analyzer: it creates searchable > terms from the HTML input. The raw HTML is stored and fetched. > > There are some bugs in term positions and highlighting, An > EntityProcessor wrapping the HTMLStripCharFIlter would be really > useful. > > On Tue, Mar 9, 2010 at 5:31 AM, Mark Roberts <mark.robe...@red-gate.com> > wrote: > > Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a > couple of problems: > > > > 1) HTML still seems to be getting into my content field > > > > All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" /> > to the index analyzer for the my "text" fieldType. > > > > > > 2) Some it seems to have broken my highlighting, I get this error: > > > > 'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token > wrong exceeds length of provided text sized 3862' > > > > > > > > Any ideas how I can fix this? > > > > > > > > > > > > -----Original Message----- > > From: Lance Norskog [mailto:goks...@gmail.com] > > Sent: 09 March 2010 04:36 > > To: solr-user@lucene.apache.org > > Subject: Re: HTML encode extracted docs > > > > A Tika integration with the DataImportHandler is in the Solr trunk. > > With this, you can copy the raw HTML into different fields and process > > one copy with Tika. > > > > If it's just straight HTML, would the HTMLStripCharFilter be good enough? > > > > http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2 > > > > On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <mark.robe...@red-gate.com> > wrote: > >> I'm uploading .htm files to be extracted - some of these files are > "include" files that have snippets of HTML rather than fully formed html > documents. > >> > >> solr-cell stores the raw HTML for these items, rather than extracting > the text. Is there any way I can get solr to encode this content prior to > storing it? > >> > >> At the moment, I have the problem that when the highlighted snippets are > retrieved via search, I need to parse the snippet and HTML encode the bits > of HTML that where indexed, whilst *not* encoding the bits that where added > by the highlighter, which is messy and time consuming. > >> > >> Thanks! Mark, > >> > > > > > > > > -- > > Lance Norskog > > goks...@gmail.com > > > > > > -- > Lance Norskog > goks...@gmail.com > -- "A person who never made a mistake never tried anything new." Albert Einstein