When I index an HTML page the attr_content field shows "//<![CDATA..." stuff
(part of a <script> tag in the original HTML page). I'm sure the the problem is
with my solrconfig.xml. Here is the section I think I'm looking to adjust.
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
<str name="fmap.script">ignored_</str> <!-- my change -->
<str name="captureAttr">true</str> <!-- my change -->
</lst>
</requestHandler>
The reference manual also mentions <script> and CDATA in connection with
HTMLStripCharFilterFactory but I the page does not explain how to apply it in
the configuratoin file.
Google led me to a class called "HTMLStripReader". I think I'd like to apply
that to the "attr_content" field, but i don't know how to apply it either.
Version: Solr 5.5 (previous version also behaves the same)
Any insights would be appreciated.