RE: problem with RegexTransformer and delimited data

Steven A Rowe Tue, 13 Apr 2010 10:47:38 -0700

Hi Gerald,

Looking at the source for RegexTransformer.process(), which is called for each 
source row, I can see that there are three mutually exclusive processing cases 
(warning - (extremely) pseudo code):


1. if (splitBy) then return row.split(splitBy)
2. else if (replaceWith) then return row.replaceAll(regex, replaceWith)
3. else return row.groups(regex, groupNames)

In other words, when you use splitBy, the other attributes (regex, replaceWith, 
etc.) are ignored.

You wrote that for data "dataA1|^dataA2|?dataB1|^dataB2|?dataC1|^dataC2", you 
want to split out and save dataA1, dataB1, and dataC1 to a multivalue
field and ignore the rest.  The following might do the trick:

<field sourceColName="myfield" column="mydata" 
       splitBy="\|\^[^|]*(?:\|\?)?"/>

The splitBy regex above would match "|^dataA2|?", "|^dataB2|?", and "|^dataC2", 
leaving { "dataA1", "dataB1", "dataC1" }.

Steve

On 04/13/2010 at 10:53 AM, Gerald wrote:
> 
> Thanks guys. Unfortunately, neither pattern works.
> 
> I tried various combos including these:
> 
> ([^|]*)\|([^|]*)  with replaceWith="$1"
> (.*?)(\|.*) with replaceWith="$1"
> (.*?)\|.* with and without replaceWith="$1"
> (.*?)\|  with and without replaceWith="$1"
> 
> As previously mentioned, I have tried many other ways without success.
> 
> I did notice that if I dont do the stripBy that it removes the
> everything from the last "|^" onwards leaving me something like
> "dataA1|^dataA2|?dataB1|^dataB2|?dataC1".
> 
> to me this doesnt look like a regex pattern issue; instead looks more
> like a
> solr/lucene issue with regex.
> 
> any other suggestions welcome.  otherwise, will have to create custom
> transformer -- View this message in context:
> http://n3.nabble.com/problem-with-
> RegexTransformer-and-delimited-data-tp713846p716206.html Sent from the
> Solr - User mailing list archive at Nabble.com.

RE: problem with RegexTransformer and delimited data

Reply via email to