/HTMLStripWhitespaceTokenizerFactory.java
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Su
/HTMLStripWhitespaceTokenizerFactory.java
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: "McBride, John" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 22, 2008 4:44:23 AM
> Subject: Inde
Actually, it's very easy: http://us2.php.net/strip_tags
I also store the data in a separate field with the html intact for
display. In that case, I use urlencode on the string.
David
McBride, John wrote:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Up
Hi,
Maybe this one?
http://htmlparser.sourceforge.net/
/Jimi
Quoting "McBride, John" <[EMAIL PROTECTED]>:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
Do you know of any common way of parsing the text content from HTML
before adding to SOLR? I understand SOLR 1
Thanks Jérôme!
It seems to work now. I just hope the provided
HTMLStripWhitespaceTokenizerFactory will strip the right tags now.
I use Java and used HtmlEncoder provided in
http://itext.ugent.be/library/api/ for encoding with success. (just
in case someone happens to search this thread)
Ravi
You need to encode your html content so it can be include as a normal
'string' value in your xml element.
As far as remember, the only unsafe characters you have to encode as
entities are:
< -> <
> -> >
" -> "e;
& -> &
(google xml entities to be sure).
I dont know what language you use , but fo
Hello,
Sorry for stupid question. I'm trying to index html file as one of
the fields in Solr, I've setup appropriate analyzer in schema but I'm
not sure how to add html content to Solr. Encapsulating HTML content
within field tag is obviously not valid. How do I add html content?
Hope the query