Re: CDATA removal from attr_content

Shawn Heisey Sun, 20 Mar 2016 04:26:26 -0700

On 3/16/2016 12:28 AM, Laszlo Kiss wrote:
> When I index an HTML page the attr_content field shows "//<![CDATA..." stuff 
> (part of a <script> tag in the original HTML page). I'm sure the the problem 
> is with my solrconfig.xml. Here is the section I think I'm looking to adjust.
>
>   <requestHandler name="/update/extract"
>                   startup="lazy"
>                   class="solr.extraction.ExtractingRequestHandler" >
>     <lst name="defaults">
>       <str name="lowernames">true</str>
>       <str name="fmap.meta">ignored_</str>
>       <str name="fmap.content">_text_</str>
>       <str name="fmap.script">ignored_</str>  <!-- my change -->
>       <str name="captureAttr">true</str>              <!-- my change -->
>     </lst>
>   </requestHandler>
>
> The reference manual also mentions <script> and CDATA in connection with 
> HTMLStripCharFilterFactory but I the page does not explain how to apply it in 
> the configuratoin file.


HTMLStripCharFilterFactory is an analysis component.  It goes in the
fieldType in your schema.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Usually a filter like this needs to be in both the index and query
analysis to work as expected, but since your users are not likely to
enter a bunch of XML/HTML as their query, it might work if only placed
on the index side.

After you click on the link above and read the section on the HTML strip
filter, scroll up to the top of the page and read the general information.

Thanks,
Shawn

Re: CDATA removal from attr_content

Reply via email to