Re: Problem with html code inside xml

[EMAIL PROTECTED] Tue, 02 Oct 2007 07:16:08 -0700

Thanks

I use this solution:

put <![CDATA[ Here my hml code ]]> in the xml to be indexed andit works, nothing to change in the xsl.


In the schema I use this fieldType

<fieldType name="html" class="solr.TextField"positionIncrementGap="100">

        <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

                <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/>

                <filter class="solr.ISOLatin1AccentFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
     </fieldType>

----------
Now question:
I created a field to index only the text for this html code.

I created a field type:

<fieldType name="htmlTxt" class="solr.TextField"positionIncrementGap="100">

        <analyzer>
                <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>

<filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

                <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/>

                <filter class="solr.ISOLatin1AccentFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>
     </fieldType>

Everything works (the div tags, p tags are removed) but some<strong>nnn</strong> or <br/> tags are style in the text afterindexing.


If you've got any idea to solve this problem it we'll be great.

Thanks

S. Christin



-------------


Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :

On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:

If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):

<field name="storyFullText">
  <html></html>
</field>

I think you should encode your content to protect these xml entities:
<  ->  &lt;

-> &gt;

" -> &quot;
& -> &amp;

If you use perl, have a look at HTML::Entities.


AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread
http://marc.info/?t=116775837900001&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2

HTH

salu2

On 9/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote:

Hello,

I've got some problem with html code who is embedded in xml file:

Sample source .

<content>
        <stories>
                <div class="storyTitle">
                         Les débats
                </div>
                <div class="storyIntroductionText">

Le premier tour des élections fédéralesse déroulera le 21

octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
vous, dont plusieurs grands débats à l'enseigne de Forums.
                </div>
                <div class="paragraph">
                        <div class="paragraphTitle"/>
                        <div class="paragraphText">
                                my para textehere
                                <br/>
                                <br/>

Vous trouverez sur cette pagetoutes les dates et les heures de

ces différents rendez-vous ainsi que le nom et les partis des

débatteurs. De plus, vous pourrez également écouter ouréécouter

l'ensemble de ces émissions.
                        </div>
                </div>
....
---------
When a make a query on solr I've got something like that in the
source code of the xml result:

<td xmlns="http://www.w3.org/1999/xhtml";>
<span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraph"</span>
<span class="markup">&gt;</span><div class="expander-content">
<div class="indent"><span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraphTitle"</span>
<span class="markup">/&gt;</span></div><table><tr>
<td class="expander">−<div class="spacer"/>
</td><td><span class="markup">&lt;</span>
...

It is not exactly what I want. I want to keep the html tags, thatall

without formatting.

So the br tags and a tags are well formed in xml and json result,but

the div tags are not kept.
---------
In the schema.xml I've got this for the html content

<fieldType name="html" class="solr.TextField" />

  <field name="storyFullText" type="html" indexed="true"
stored="true" multiValued="true"/>

---------

Any help would be appreciate.

Thanks in advance.

S. Christin

--

Thorsten Scherlerthorsten.at.apache.orgOpen Source Java consulting, training andsolutions

Re: Problem with html code inside xml

Reply via email to