No problem: wrapping and unwrapping escaped text can be very confusing.

On Fri, Mar 26, 2010 at 6:31 AM, Niraj Aswani <n.asw...@dcs.shef.ac.uk> wrote:
> Hi Lance,
>
> apologies.. please ignore my previous mail.  I'll have a look at the
> PatternReplaceFilter.
>
> Thanks,
> Niraj
>
> Niraj Aswani wrote:
>>
>> Hi Lance,
>>
>> Yes, that is once solution but wouldn't it stop people searching for
>> something like "<choice" in the first place?  I mean, if I encode such
>> characters at the index time, one would have to write a query like
>> "&lt;choice".  Am I right?
>>
>> Thanks,
>> Niraj
>>
>> Lance Norskog wrote:
>>>
>>> To display html-markup in an html page, it has to be in entity-encoded
>>> form. So, encode the <> as entities in your input application, and
>>> have it indexed and stored in this format. Then, the <b><u> are
>>> inserted as normal. This gives you the html text displayable in an
>>> html page, with all words highlightable. And add gt/lt etc. as
>>> stopwords.
>>>
>>> At this point you have the element names, attribute names and values,
>>> and text parts searchable and highlightable. If you only want the HTML
>>> syntax parts shown, the PatternReplaceFilter is your friend: with
>>> regex patterns you can pull out those values and ignore the text
>>> parts.
>>>
>>> The analysis.jsp page will make it much much easier to debug this.
>>>
>>> Good luck!
>>>
>>> On Thu, Mar 25, 2010 at 8:21 AM, Niraj Aswani <n.asw...@dcs.shef.ac.uk>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I am using the following two parameters to highlight the hits.
>>>>
>>>> "hl.simple.pre=" + URLEncoder.encode("<b><u>")
>>>> "hl.simple.post=" + URLEncoder.encode("</u></b>")
>>>>
>>>> This seems to work.  However, there is a bit of trouble when the text
>>>> itself
>>>> contains html markup.
>>>>
>>>> For example, I have indexed a document with the following text in it.
>>>> =======
>>>> something here...
>>>> <choice minOccurs="1" maxOccurs="unbounded">xyz</choice>
>>>> something here..
>>>> =======
>>>>
>>>> When I search for the keyword choice, what it does is, it inserts
>>>> "<b><u>"
>>>> just before the word choice and "</u></b>" immediately after the word
>>>> choice. It results into something like below:
>>>>
>>>> <<b><u>choice</b></u> minOccurs="1"
>>>> maxOccurs="unbounded">xyz</<b><u>choice</u></b>>
>>>>
>>>>
>>>> I would like it to be something like:
>>>>
>>>> &lt;<b><u>choice</b></u> minOccurs="1"
>>>> maxOccurs="unbounded"&gt;xyz/<b><u>choice</u></b>&gt;
>>>>
>>>> Is there any way to do it such that the highlight content is encoded as
>>>> HTML
>>>> but the prefix and suffix are not?
>>>>
>>>> Thanks,
>>>> Niraj
>>>>
>>>>
>>>>
>>>> When I issue a query, it returns all the corret
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to