ok i have html pages with <html>.....<!--body-->content i 
want....<!--/body-->.....</html>. i want to extract (index, store) only that 
between the body-comments. i thought regexTransformer would be the best because 
xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. 
what i have also found out is that the htmlparser from tika cuts my 
body-comments out and tries to make well formed html, which i would like to 
switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>> i've managed to get it working if i use the regexTransformer and string is 
>> on the same line in my tika entity. but when the string is multilined it 
>> isn't working even though i tried ?s to set the flag dotall.
>> 
>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" 
>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
>> transformer="RegexTransformer">
>>      <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;" 
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>> </entity>
>>                      
>> then i tried it like this and i get a stackoverflow
>> 
>> <field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;" 
>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
>> 
>> in javascript this works but maybe because i only used a small string.
> 
> Sounds like we've got an XY problem here.
> 
> http://people.apache.org/~hossman/#xyproblem
> 
> How about you tell us *exactly* what you'd actually like to have happen
> and then we can find a solution for you?
> 
> It sounds a little bit like you're interested in stripping all the HTML
> tags out.  Perhaps the HTMLStripCharFilter?
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Something that I already said: By using the KeywordTokenizer, you won't
> be able to search for individual words on your HTML input.  The entire
> input string is treated as a single token, and therefore ONLY exact
> entire-field matches (or certain wildcard matches) will be possible.
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
> 
> Note that no matter what you do to your data with the analysis chain,
> Solr will always return the text that was originally indexed in search
> results.  If you need to affect what gets stored as well, perhaps you
> need an Update Processor.
> 
> Thanks,
> Shawn

Reply via email to