Named entity references are valid in XML. They just need to be
declared
before they are used[1], unless they are one of the builtin named
entities < > ' " or & -- these are always valid
when
parsing with an XML parser.
Correct, it was an offhand comment and I skipped over all the
details
On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote:
But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
Parisspan>
which would have to be accomplished some other way.
Yes, exactly. And I think nutch handles this somehow as I remember
Thanks all for very valuable contributions, I understand these aspects
of Solr much better now
but...
>But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
> Paris
>which would have to be accomplished some other way.
Yes, exactly. And
At 9:32 PM +1000 10/5/07, Adrian Sutton wrote:
>From what people are suggesting though you'd be better off converting to plain
>text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net)
>can parse most HTML that's around and you can iterate over the DOM to extract
>the text f
Adrian Sutton wrote:
> We didn't do anything at all to the HTML, the editor returns valid XHTML
> (using numeric entities, never named entities which aren't valid in XML
> and don't tend to work in XHTML) [...]
Named entity references are valid in XML. They just need to be declared
before they ar
That is one seriously manly regex, but I'd recommend using the Tag Soup
parser instead:
http://ccil.org/~cowan/XML/tagsoup/
wunder
On 10/4/07 10:11 PM, "J.J. Larrea" <[EMAIL PROTECTED]> wrote:
> It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or
> XML-like tags:
>
One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities? I did this and HTMLStripper
doesn't seem to recognise them the tags :-S While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.
Thanks Adrian, I'm very new to Solr myself so struggling a bit in
initial stages...
One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities? I did this and HTMLStripper
doesn't seem to recognise them the tags :-S While if I try and input
HTML as i
On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
(Query esp. Adrian):
If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation? What's
the best way out?
Thanks all for help.
Just to make sure I understand correctly, am I right in summarizing
this way than?:
No significance of using HTML: Unlike nutch Solr doesn't parse HTML,
so it ignores the anchors, titles etc and is not good for page rank
-esq indexing.
HTMLAnalyser (by with you probably mean
At 3:45 PM -0700 10/4/07, Mike Klaas wrote:
>I'm actually somewhat surprised that several people are interested in this but
>none have have been sufficiently interested to implement a solution to
>contribute:
>
>http://issues.apache.org/jira/browse/SOLR-42
I just devised a workaround earlier in
Wow, well-formed HTML. That's a rare beast. --wunder
On 10/4/07 7:08 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.
: In general, I don't recommend indexing HTML content straight to Solr. None of
: the Solr contributors do this so the use case hasn't received a lot of love.
I second that comment ... the HTML Striping code was never intended to be
an "HTML Parser" it was designed to be a workarround for deali
On 05/10/2007, at 8:45 AM, Mike Klaas wrote:
In general, I don't recommend indexing HTML content straight to
Solr. None of the Solr contributors do this so the use case hasn't
received a lot of love.
We're indexing XHTML straight to Solr and it's working great so far.
I'm actually somewhat
On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:
I see that you're using the HTML analyzer. Unfortunately that
does not play very well with highlighting at the moment. You may
get garbled output.
Is it the HTML analyzer or the fact that it's HTML content? If it's
just the analyzer you could
I see that you're using the HTML analyzer. Unfortunately that does
not play very well with highlighting at the moment. You may get
garbled output.
Is it the HTML analyzer or the fact that it's HTML content? If it's
just the analyzer you could always just copy the HTML content to
another
On 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:
I see that you're using the HTML analyzer. Unfortunately that does
not play very well with highlighting at the moment. You may get
garbled output.
-Mike
In 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:
I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).
I will be really grateful if someone can guid
18 matches
Mail list logo