Re: Indexing HTML files in SOLR

2010-06-20 Thread seesiddharth
Thank you so much for your help... I will try it... -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p910555.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML files in SOLR

2010-06-19 Thread Lance Norskog
Ah! You need a SolrJ program that uses Tika to parse the files and upload the text. I think there is such a program already but do not know where it is. Lance On Thu, Jun 17, 2010 at 6:13 AM, seesiddharth wrote: > > Thank you so much for the reply...The link suggested by you is helpful but > the

Re: Indexing HTML files in SOLR

2010-06-17 Thread seesiddharth
Thank you so much for the reply...The link suggested by you is helpful but they have explain everything with use of curl command which I don't want to use. I was more interested in uploading the .html documents using HTTP web request. So I have stored all .html files at one location & then create

Re: Indexing HTML files in SOLR

2010-06-16 Thread Lance Norskog
This is the tool in Solr for indexing various kinds of content. After you learn the basics of indexing (see solr/example/exampledocs for samples), the ExtractingRequestHandler will make sense: http://wiki.apache.org/solr/ExtractingRequestHandler On Tue, Jun 15, 2010 at 12:35 AM, seesiddharth wro

Re: Indexing HTML

2010-06-10 Thread Lance Norskog
Looking at it again, there appears to be only one HTML stripper. Your alternative is to use the regex PatternReplace stuff with some custom patterns. Ok make a stopword list of all html keywords. On Thu, Jun 10, 2010 at 8:00 AM, Blargy wrote: > > Do I even need to tidy/clean up the html if I use

Re: Indexing HTML

2010-06-10 Thread Blargy
Do I even need to tidy/clean up the html if I use the HTMLStripCharFilterFactory? -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

2010-06-09 Thread Ken Krugler
On Jun 9, 2010, at 8:38pm, Blargy wrote: What is the preferred way to index html using DIH (my html is stored in a blob field in our database)? I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom tra

Re: Indexing HTML

2010-06-09 Thread Blargy
Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at index time? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html Sent from

Re: Indexing HTML

2010-06-09 Thread Blargy
Does the HTMLStripChar apply at index time or query time? Would it matter to use over the other? As a side question, if I want to perform highlighter summaries against this field do I need to store the whole field or just index it with TermVector.WITH_POSITIONS_OFFSETS? -- View this message in

Re: Indexing HTML

2010-06-09 Thread Lance Norskog
The HTMLStripChar variants are newer and might work better. On Wed, Jun 9, 2010 at 8:38 PM, Blargy wrote: > > What is the preferred way to index html using DIH (my html is stored in a > blob field in our database)? > > I know there is the built in HTMLStripTransformer but that doesn't seem to > w

Re: Indexing HTML document

2010-03-03 Thread György Frivolt
Thank you! That's even more I wanted to know. ;) Georg On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood wrote: > You are in luck, because Avi Rappoport has just written a tutorial about > how to do this. It is available from Lucid Imagination: > > > http://www.lucidimagination.com/solutions/wh

Re: Indexing HTML document

2010-03-02 Thread Walter Underwood
You are in luck, because Avi Rappoport has just written a tutorial about how to do this. It is available from Lucid Imagination: http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr I've just started reviewing it, but knowing Avi, I expect it to be very he

Re: Indexing HTML document

2010-03-02 Thread Siddhant Goel
There is an HTML filter documented here, which might be of some help - http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Control characters can be eliminated using code like this - http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-44

RE: Indexing HTML Content

2008-05-22 Thread Lance Norskog
amount of string processing it does, the fact that it is a Reader probably does not affect its performance. Cheers, Lance -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 10:14 AM To: solr-user@lucene.apache.org Subject: Re: Indexing HTML

Re: Indexing HTML Content

2008-05-22 Thread Otis Gospodnetic
John, Solr already has some of this stuff: $ ff \*HTML\*java ./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java ./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java ./src/java/org/apache/solr/analysis/HTMLStripReader.java ./src/java/org/apache/solr/analysis/HTMLStr

Re: Indexing HTML Content

2008-05-22 Thread David Arpad Geller
Actually, it's very easy: http://us2.php.net/strip_tags I also store the data in a separate field with the html intact for display. In that case, I use urlencode on the string. David McBride, John wrote: Hello, In my application I wish to index articles which are stored in HTML format. Up

Re: Indexing HTML Content

2008-05-22 Thread solr
Hi, Maybe this one? http://htmlparser.sourceforge.net/ /Jimi Quoting "McBride, John" <[EMAIL PROTECTED]>: Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable.

Re: Indexing HTML

2007-10-04 Thread Mike Klaas
On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote: Because of this I cannot present the resulting html in a webpage. Is it possible to strip out all HTML tags completely in result set? Would you recommend sending stripped out text to solr instead? But doesn't Solr use HTML features while searchin

Re: Indexing HTML

2007-10-03 Thread Ravish Bhagdev
Hi Erik, All, I escaped HTML text into entities before sending to Solr and indexing went fine. The problem now is that when I get back a snippet with highlighted text for the html field, its not well formed as the highliting dosen't somtimes include the entire tag if present. For e.g.: −

Re: Indexing HTML

2007-08-27 Thread Erik Hatcher
On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote: What's odd about this is that the error seems to indicate that I did. Actually the error message looks like you escaped too much. You should _not_ escape , only the contents of it. Erik The full text (minus the stack trace)

Re: Indexing HTML

2007-08-27 Thread Michael Kimsal
What's odd about this is that the error seems to indicate that I did. The full text (minus the stack trace) was org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG or TEXT to read text (position: START_TAG seen ...... @4:37) Or is that just a by

Re: Indexing HTML

2007-08-27 Thread Erik Hatcher
Michael, I think the issue is that you're not escaping the values. Send something like this to Solr instead: linktext Erik On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote: Hello I'm trying to index individual lines of an HTML file, a

Re: Indexing HTML

2007-08-27 Thread Thierry Collogne
I think you can use the HTMLStripWhitespaceTokenizerFactory. Look here : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e I hope this helps On 27/08/07, Michael Kimsal <[EMAIL PROTECTED]> wrote: > > Hello > > I'm trying to index individu

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Ravish Bhagdev
Thanks Jérôme! It seems to work now. I just hope the provided HTMLStripWhitespaceTokenizerFactory will strip the right tags now. I use Java and used HtmlEncoder provided in http://itext.ugent.be/library/api/ for encoding with success. (just in case someone happens to search this thread) Ravi

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Jérôme Etévé
You need to encode your html content so it can be include as a normal 'string' value in your xml element. As far as remember, the only unsafe characters you have to encode as entities are: < -> < > -> > " -> "e; & -> & (google xml entities to be sure). I dont know what language you use , but fo

Re: Indexing HTML and other doc types

2007-07-06 Thread Otis Gospodnetic
he.org Sent: Friday, July 6, 2007 2:19:21 AM Subject: Re: Indexing HTML and other doc types I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple do

RE: Indexing HTML and other doc types

2007-07-06 Thread Teruhiko Kurosaka
Peter, I was playing with Nutch for quite some time before Solr, so I know Nutch better than Solr. Nutch has a plugin mechanism so that you can add a parser for a document type. It comes with parser plugins for most popular doc types (with varying degrees of international text support). My que

Re: Indexing HTML and other doc types

2007-07-05 Thread Peter Manis
I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple document types such as PDF, DOC, etc and Nutch does not provide functionality to do so you woul

RE: Indexing HTML and other doc types

2007-07-05 Thread Teruhiko Kurosaka
Thank you, Otis and Peter, for your replies. > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > doc of some type -> parse content into various fields -> post to Solr I understand this part, but the question is who should do this. I was under assumption that it's Solr client's job to crawl the

Re: Indexing HTML and other doc types

2007-07-04 Thread Peter Manis
A coworker of mine posted the code that we used for adding pdf, doc, xls, etc documents into solr. You can find the files at the following location. https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Just apply the patch and put the

Re: Indexing HTML and other doc types

2007-07-03 Thread Otis Gospodnetic
Kuro, doc of some type -> parse content into various fields -> post to Solr Even Nutch does the same - there is a title field, a content field, and so on (the exact names may be different). Of course, you can always just combine everything into a single content field. Otis . . . . . . . . . .