No problem: wrapping and unwrapping escaped text can be very confusing. On Fri, Mar 26, 2010 at 6:31 AM, Niraj Aswani <n.asw...@dcs.shef.ac.uk> wrote: > Hi Lance, > > apologies.. please ignore my previous mail. I'll have a look at the > PatternReplaceFilter. > > Thanks, > Niraj > > Niraj Aswani wrote: >> >> Hi Lance, >> >> Yes, that is once solution but wouldn't it stop people searching for >> something like "<choice" in the first place? I mean, if I encode such >> characters at the index time, one would have to write a query like >> "<choice". Am I right? >> >> Thanks, >> Niraj >> >> Lance Norskog wrote: >>> >>> To display html-markup in an html page, it has to be in entity-encoded >>> form. So, encode the <> as entities in your input application, and >>> have it indexed and stored in this format. Then, the <b><u> are >>> inserted as normal. This gives you the html text displayable in an >>> html page, with all words highlightable. And add gt/lt etc. as >>> stopwords. >>> >>> At this point you have the element names, attribute names and values, >>> and text parts searchable and highlightable. If you only want the HTML >>> syntax parts shown, the PatternReplaceFilter is your friend: with >>> regex patterns you can pull out those values and ignore the text >>> parts. >>> >>> The analysis.jsp page will make it much much easier to debug this. >>> >>> Good luck! >>> >>> On Thu, Mar 25, 2010 at 8:21 AM, Niraj Aswani <n.asw...@dcs.shef.ac.uk> >>> wrote: >>> >>>> >>>> Hi, >>>> >>>> I am using the following two parameters to highlight the hits. >>>> >>>> "hl.simple.pre=" + URLEncoder.encode("<b><u>") >>>> "hl.simple.post=" + URLEncoder.encode("</u></b>") >>>> >>>> This seems to work. However, there is a bit of trouble when the text >>>> itself >>>> contains html markup. >>>> >>>> For example, I have indexed a document with the following text in it. >>>> ======= >>>> something here... >>>> <choice minOccurs="1" maxOccurs="unbounded">xyz</choice> >>>> something here.. >>>> ======= >>>> >>>> When I search for the keyword choice, what it does is, it inserts >>>> "<b><u>" >>>> just before the word choice and "</u></b>" immediately after the word >>>> choice. It results into something like below: >>>> >>>> <<b><u>choice</b></u> minOccurs="1" >>>> maxOccurs="unbounded">xyz</<b><u>choice</u></b>> >>>> >>>> >>>> I would like it to be something like: >>>> >>>> <<b><u>choice</b></u> minOccurs="1" >>>> maxOccurs="unbounded">xyz/<b><u>choice</u></b>> >>>> >>>> Is there any way to do it such that the highlight content is encoded as >>>> HTML >>>> but the prefix and suffix are not? >>>> >>>> Thanks, >>>> Niraj >>>> >>>> >>>> >>>> When I issue a query, it returns all the corret >>>> >>>> >>> >>> >>> >>> >> > >
-- Lance Norskog goks...@gmail.com