Thanks

I use this solution:

put <![CDATA[ Here my hml code ]]> in the xml to be indexed and it works, nothing to change in the xsl.

In the schema I use this fieldType

<fieldType name="html" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.ISOLatin1AccentFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
     </fieldType>

----------
Now question:
I created a field to index only the text for this html code.

I created a field type:

<fieldType name="htmlTxt" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
                <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.ISOLatin1AccentFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
     </fieldType>

Everything works (the div tags, p tags are removed) but some <strong>nnn</strong> or <br/> tags are style in the text after indexing.

If you've got any idea to solve this problem it we'll be great.

Thanks

S. Christin



-------------


Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :

On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):

<field name="storyFullText">
  <html></html>
</field>

I think you should encode your content to protect these xml entities:
<  ->  &lt;
-> &gt;
" -> &quot;
& -> &amp;

If you use perl, have a look at HTML::Entities.

AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread
http://marc.info/?t=116775837900001&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2

HTH

salu2



On 9/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,

I've got some problem with html code who is embedded in xml file:

Sample source .

<content>
        <stories>
                <div class="storyTitle">
                         Les débats
                </div>
                <div class="storyIntroductionText">
Le premier tour des élections fédérales se déroulera le 21
octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
vous, dont plusieurs grands débats à l'enseigne de Forums.
                </div>
                <div class="paragraph">
                        <div class="paragraphTitle"/>
                        <div class="paragraphText">
                                my para textehere
                                <br/>
                                <br/>
Vous trouverez sur cette page toutes les dates et les heures de
ces différents rendez-vous ainsi que le nom et les partis des
débatteurs. De plus, vous pourrez également écouter ou réécouter
l'ensemble de ces émissions.
                        </div>
                </div>
....
---------
When a make a query on solr I've got something like that in the
source code of the xml result:

<td xmlns="http://www.w3.org/1999/xhtml";>
<span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraph"</span>
<span class="markup">&gt;</span><div class="expander-content">
<div class="indent"><span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraphTitle"</span>
<span class="markup">/&gt;</span></div><table><tr>
<td class="expander">−<div class="spacer"/>
</td><td><span class="markup">&lt;</span>
...

It is not exactly what I want. I want to keep the html tags, that all
without formatting.

So the br tags and a tags are well formed in xml and json result, but
the div tags are not kept.
---------
In the schema.xml I've got this for the html content

<fieldType name="html" class="solr.TextField" />

  <field name="storyFullText" type="html" indexed="true"
stored="true" multiValued="true"/>

---------

Any help would be appreciate.

Thanks in advance.

S. Christin








--
Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions


Reply via email to