On 3/16/2016 12:28 AM, Laszlo Kiss wrote: > When I index an HTML page the attr_content field shows "//<![CDATA..." stuff > (part of a <script> tag in the original HTML page). I'm sure the the problem > is with my solrconfig.xml. Here is the section I think I'm looking to adjust. > > <requestHandler name="/update/extract" > startup="lazy" > class="solr.extraction.ExtractingRequestHandler" > > <lst name="defaults"> > <str name="lowernames">true</str> > <str name="fmap.meta">ignored_</str> > <str name="fmap.content">_text_</str> > <str name="fmap.script">ignored_</str> <!-- my change --> > <str name="captureAttr">true</str> <!-- my change --> > </lst> > </requestHandler> > > The reference manual also mentions <script> and CDATA in connection with > HTMLStripCharFilterFactory but I the page does not explain how to apply it in > the configuratoin file.
HTMLStripCharFilterFactory is an analysis component. It goes in the fieldType in your schema. https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Usually a filter like this needs to be in both the index and query analysis to work as expected, but since your users are not likely to enter a bunch of XML/HTML as their query, it might work if only placed on the index side. After you click on the link above and read the section on the HTML strip filter, scroll up to the top of the page and read the general information. Thanks, Shawn