Thanks
I use this solution:
put <![CDATA[ Here my hml code ]]> in the xml to be indexed and
it works, nothing to change in the xsl.
In the schema I use this fieldType
<fieldType name="html" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
----------
Now question:
I created a field to index only the text for this html code.
I created a field type:
<fieldType name="htmlTxt" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Everything works (the div tags, p tags are removed) but some
<strong>nnn</strong> or <br/> tags are style in the text after
indexing.
If you've got any idea to solve this problem it we'll be great.
Thanks
S. Christin
-------------
Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :
On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):
<field name="storyFullText">
<html></html>
</field>
I think you should encode your content to protect these xml entities:
< -> <
-> >
" -> "
& -> &
If you use perl, have a look at HTML::Entities.
AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.
Have a look at the thread
http://marc.info/?t=116775837900001&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2
HTH
salu2
On 9/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]>
wrote:
Hello,
I've got some problem with html code who is embedded in xml file:
Sample source .
<content>
<stories>
<div class="storyTitle">
Les débats
</div>
<div class="storyIntroductionText">
Le premier tour des élections fédérales
se déroulera le 21
octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
vous, dont plusieurs grands débats à l'enseigne de Forums.
</div>
<div class="paragraph">
<div class="paragraphTitle"/>
<div class="paragraphText">
my para textehere
<br/>
<br/>
Vous trouverez sur cette page
toutes les dates et les heures de
ces différents rendez-vous ainsi que le nom et les partis des
débatteurs. De plus, vous pourrez également écouter ou
réécouter
l'ensemble de ces émissions.
</div>
</div>
....
---------
When a make a query on solr I've got something like that in the
source code of the xml result:
<td xmlns="http://www.w3.org/1999/xhtml">
<span class="markup"><</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraph"</span>
<span class="markup">></span><div class="expander-content">
<div class="indent"><span class="markup"><</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraphTitle"</span>
<span class="markup">/></span></div><table><tr>
<td class="expander">−<div class="spacer"/>
</td><td><span class="markup"><</span>
...
It is not exactly what I want. I want to keep the html tags, that
all
without formatting.
So the br tags and a tags are well formed in xml and json result,
but
the div tags are not kept.
---------
In the schema.xml I've got this for the html content
<fieldType name="html" class="solr.TextField" />
<field name="storyFullText" type="html" indexed="true"
stored="true" multiValued="true"/>
---------
Any help would be appreciate.
Thanks in advance.
S. Christin
--
Thorsten Scherler
thorsten.at.apache.org
Open Source Java consulting, training and
solutions