When I index an HTML page the attr_content field shows "//<![CDATA..." stuff 
(part of a <script> tag in the original HTML page). I'm sure the the problem is 
with my solrconfig.xml. Here is the section I think I'm looking to adjust.

  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
      <str name="fmap.script">ignored_</str>    <!-- my change -->
      <str name="captureAttr">true</str>                <!-- my change -->
    </lst>
  </requestHandler>

The reference manual also mentions <script> and CDATA in connection with 
HTMLStripCharFilterFactory but I the page does not explain how to apply it in 
the configuratoin file.

Google led me to a class called "HTMLStripReader". I think I'd like to apply 
that to the "attr_content" field, but i don't know how to apply it either.

Version: Solr 5.5 (previous version also behaves the same)

Any insights would be appreciated.

Reply via email to