On 9/6/2013 7:09 AM, Andreas Owen wrote:
> i've managed to get it working if i use the regexTransformer and string is on
> the same line in my tika entity. but when the string is multilined it isn't
> working even though i tried ?s to set the flag dotall.
>
> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
> transformer="RegexTransformer">
> <field column="text_html" regex="<body>(.+)</body>"
> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" />
> </entity>
>
> then i tried it like this and i get a stackoverflow
>
> <field column="text_html" regex="<body>((.|\n|\r)+)</body>"
> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" />
>
> in javascript this works but maybe because i only used a small string.
Sounds like we've got an XY problem here.
http://people.apache.org/~hossman/#xyproblem
How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?
It sounds a little bit like you're interested in stripping all the HTML
tags out. Perhaps the HTMLStripCharFilter?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Something that I already said: By using the KeywordTokenizer, you won't
be able to search for individual words on your HTML input. The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed in search
results. If you need to affect what gets stored as well, perhaps you
need an Update Processor.
Thanks,
Shawn