Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton
Named entity references are valid in XML. They just need to be declared before they are used[1], unless they are one of the builtin named entities < > ' " or & -- these are always valid when parsing with an XML parser. Correct, it was an offhand comment and I skipped over all the details

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Mike Klaas
On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote: But a different use-case might be for the highlighting to encompass the markup rather than >just the text, e.g. Parisspan> which would have to be accomplished some other way. Yes, exactly. And I think nutch handles this somehow as I remember

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks all for very valuable contributions, I understand these aspects of Solr much better now but... >But a different use-case might be for the highlighting to encompass the markup rather than >just the text, e.g. > Paris >which would have to be accomplished some other way. Yes, exactly. And

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread J.J. Larrea
At 9:32 PM +1000 10/5/07, Adrian Sutton wrote: >From what people are suggesting though you'd be better off converting to plain >text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) >can parse most HTML that's around and you can iterate over the DOM to extract >the text f

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Steven Rowe
Adrian Sutton wrote: > We didn't do anything at all to the HTML, the editor returns valid XHTML > (using numeric entities, never named entities which aren't valid in XML > and don't tend to work in XHTML) [...] Named entity references are valid in XML. They just need to be declared before they ar

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Walter Underwood
That is one seriously manly regex, but I'd recommend using the Tag Soup parser instead: http://ccil.org/~cowan/XML/tagsoup/ wunder On 10/4/07 10:11 PM, "J.J. Larrea" <[EMAIL PROTECTED]> wrote: > It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or > XML-like tags: >

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton
One last one, when you send HTML to solr, do you too replace special chars and tags with named entities? I did this and HTMLStripper doesn't seem to recognise them the tags :-S While if I try and input HTML as is indexer throws exceptions (as having tags within XML tags is obviously not valid.

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks Adrian, I'm very new to Solr myself so struggling a bit in initial stages... One last one, when you send HTML to solr, do you too replace special chars and tags with named entities? I did this and HTMLStripper doesn't seem to recognise them the tags :-S While if I try and input HTML as i

Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton
On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote: (Query esp. Adrian): If you are indexing XHTML, do you replace tags with entities before giving it to solr, if so, when you get back snippets do you get tags or entities or do you convert again to tags for presentation? What's the best way out?

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Ravish Bhagdev
Thanks all for help. Just to make sure I understand correctly, am I right in summarizing this way than?: No significance of using HTML: Unlike nutch Solr doesn't parse HTML, so it ignores the anchors, titles etc and is not good for page rank -esq indexing. HTMLAnalyser (by with you probably mean

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread J.J. Larrea
At 3:45 PM -0700 10/4/07, Mike Klaas wrote: >I'm actually somewhat surprised that several people are interested in this but >none have have been sufficiently interested to implement a solution to >contribute: > >http://issues.apache.org/jira/browse/SOLR-42 I just devised a workaround earlier in

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Walter Underwood
Wow, well-formed HTML. That's a rare beast. --wunder On 10/4/07 7:08 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > if you have wellformed HTML documents, use an HTML parser to extract the > real content.

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Chris Hostetter
: In general, I don't recommend indexing HTML content straight to Solr. None of : the Solr contributors do this so the use case hasn't received a lot of love. I second that comment ... the HTML Striping code was never intended to be an "HTML Parser" it was designed to be a workarround for deali

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton
On 05/10/2007, at 8:45 AM, Mike Klaas wrote: In general, I don't recommend indexing HTML content straight to Solr. None of the Solr contributors do this so the use case hasn't received a lot of love. We're indexing XHTML straight to Solr and it's working great so far. I'm actually somewhat

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas
On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote: I see that you're using the HTML analyzer. Unfortunately that does not play very well with highlighting at the moment. You may get garbled output. Is it the HTML analyzer or the fact that it's HTML content? If it's just the analyzer you could

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton
I see that you're using the HTML analyzer. Unfortunately that does not play very well with highlighting at the moment. You may get garbled output. Is it the HTML analyzer or the fact that it's HTML content? If it's just the analyzer you could always just copy the HTML content to another

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas
On 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote: I see that you're using the HTML analyzer. Unfortunately that does not play very well with highlighting at the moment. You may get garbled output. -Mike

Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas
In 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote: I have tried very hard to follow documentation and forums that try to answer questions about how to return snippets with highlights for relevant searched term using Solr (as nutch does with such ease). I will be really grateful if someone can guid