Hi Solr gurus, I'm wondering if there is an easy way to keep the targets of hyperlinks from a field which may contain HTML fragments, while stripping the HTML.
e.g. if I had a field that looked like this: "This is the entire content of my field, but http://example.com/ some of the words are a hyperlink." Then I'd like to keep "http://example.com/" as a single token (along with all of the actual words) but not the "a" and "href", giving me: "This is the entire content of my field but http://example.com/ some of the words are a hyperlink" I'm thinking that since we're dealing with individual fragments rather than entire HTML pages, Tika/SolrCell may be poorly suited and/or too heavyweight -- but please correct me if I'm wrong. Maybe something using regular expressions? Does anyone have a code snippet they could share? Many thanks, Andrew. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p874547.html Sent from the Solr - User mailing list archive at Nabble.com.