i've managed to get it working if i use the regexTransformer and string is on
the same line in my tika entity. but when the string is multilined it isn't
working even though i tried ?s to set the flag dotall.
<entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
transformer="RegexTransformer">
<field column="text_html" regex="<body>(.+)</body>"
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" />
</entity>
then i tried it like this and i get a stackoverflow
<field column="text_html" regex="<body>((.|\n|\r)+)</body>"
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" />
in javascript this works but maybe because i only used a small string.
On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:
> Is there any chance that your changed your schema since you indexed the data?
> If so, re-index the data.
>
> If a "*" query finds nothing, that implies that the default field is empty.
> Are you sure the "df" parameter is set to the field containing your data?
> Show us your request handler definition and a sample of your actual Solr
> input (Solr XML or JSON?) so that we can see what fields are being populated.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Andreas Owen
> Sent: Friday, September 06, 2013 4:01 AM
> To: [email protected]
> Subject: Re: charfilter doesn't do anything
>
> the input string is a normal html page with the word Zahlungsverkehr in it
> and my query is ...solr/collection1/select?q=*
>
> On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
>
>> And show us an input string and a query that fail.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Shawn Heisey
>> Sent: Thursday, September 05, 2013 2:41 PM
>> To: [email protected]
>> Subject: Re: charfilter doesn't do anything
>>
>> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>>> i would like to filter / replace a word during indexing but it doesn't do
>>> anything and i dont get a error.
>>>
>>> in schema.xml i have the following:
>>>
>>> <field name="text_html" type="text_cutHtml" indexed="true" stored="true"
>>> multiValued="true"/>
>>>
>>> <fieldType name="text_cutHtml" class="solr.TextField">
>>> <analyzer>
>>> <!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
>>> <charFilter class="solr.PatternReplaceCharFilterFactory"
>>> pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> my 2. question is where can i say that the expression is multilined like in
>>> javascript i can use /m at the end of the pattern?
>>
>> I don't know about your second question. I don't know if that will be
>> possible, but I'll leave that to someone who's more expert than I.
>>
>> As for the first question, here's what I have. Did you reindex? That
>> will be required.
>>
>> http://wiki.apache.org/solr/HowToReindex
>>
>> Assuming that you did reindex, are you trying to search for ASDFGHJK in
>> a field that contains more than just "Zahlungsverkehr"? The keyword
>> tokenizer might not do what you expect - it tokenizes the entire input
>> string as a single token, which means that you won't be able to search
>> for single words in a multi-word field without wildcards, which are
>> pretty slow.
>>
>> Note that both the pattern and replacement are case sensitive. This is
>> how regex works. You haven't used a lowercase filter, which means that
>> you won't be able to search for asdfghjk.
>>
>> Use the analysis tab in the UI on your core to see what Solr does to
>> your field text.
>>
>> Thanks,
>> Shawn