RE: Indexing HTML Content

2008-05-22 Thread Lance Norskog
/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Su

Re: Indexing HTML Content

2008-05-22 Thread Otis Gospodnetic
/HTMLStripWhitespaceTokenizerFactory.java Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "McBride, John" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 22, 2008 4:44:23 AM > Subject: Inde

Re: Indexing HTML Content

2008-05-22 Thread David Arpad Geller
Actually, it's very easy: http://us2.php.net/strip_tags I also store the data in a separate field with the html intact for display. In that case, I use urlencode on the string. David McBride, John wrote: Hello, In my application I wish to index articles which are stored in HTML format. Up

Re: Indexing HTML Content

2008-05-22 Thread solr
Hi, Maybe this one? http://htmlparser.sourceforge.net/ /Jimi Quoting "McBride, John" <[EMAIL PROTECTED]>: Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable.

Indexing HTML Content

2008-05-22 Thread McBride, John
Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Ravish Bhagdev
Thanks Jérôme! It seems to work now. I just hope the provided HTMLStripWhitespaceTokenizerFactory will strip the right tags now. I use Java and used HtmlEncoder provided in http://itext.ugent.be/library/api/ for encoding with success. (just in case someone happens to search this thread) Ravi

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Jérôme Etévé
You need to encode your html content so it can be include as a normal 'string' value in your xml element. As far as remember, the only unsafe characters you have to encode as entities are: < -> < > -> > " -> "e; & -> & (google xml entities to be sure). I dont know what language you use , but fo

Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Ravish Bhagdev
Hello, Sorry for stupid question. I'm trying to index html file as one of the fields in Solr, I've setup appropriate analyzer in schema but I'm not sure how to add html content to Solr. Encapsulating HTML content within field tag is obviously not valid. How do I add html content? Hope the query