Indexing link targets in HTML fragments

Andrew Clegg Sun, 06 Jun 2010 12:16:32 -0700

Hi Solr gurus,

I'm wondering if there is an easy way to keep the targets of hyperlinks from
a field which may contain HTML fragments, while stripping the HTML.


e.g. if I had a field that looked like this:

"This is the entire content of my field, but  http://example.com/ some of
the words  are a hyperlink."

Then I'd like to keep "http://example.com/"; as a single token (along with
all of the actual words) but not the "a" and "href", giving me:

"This is the entire content of my field but http://example.com/ some of the
words are a hyperlink"

I'm thinking that since we're dealing with individual fragments rather than
entire HTML pages, Tika/SolrCell may be poorly suited and/or too heavyweight
-- but please correct me if I'm wrong.

Maybe something using regular expressions? Does anyone have a code snippet
they could share?

Many thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p874547.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing link targets in HTML fragments

Reply via email to