follow-up:

regex="([^\|]+)\|\d+,\d+,\d+,(.+)"

is the version I chose after I had the following problems with
regex="([^\|]+)\|\d+,\d+,\d+,(.*)"
(changed * into + for the second group):

The role field contained empty values even if I added a TrimFilterFactory with minimum length of 1. So, I changed the regular expression to find only non-empty values. Well, it does now - but if it cannot find a value for the second group it doesn't even add the value for the first group.

Any help on getting this solved is greatly appreciated.
It boils down to this question:

- How can I achieve that the RegexTransformer adds a value only if
it contains a non-empty value and avoiding at the same time that it only adds values when all of the groups contain values.

Maybe the configuration with groupNames is meant to work like that. If that is the case, it's probably worth adding this information to the Wiki. I will change back to using the sourceCol attribute as
https://issues.apache.org/jira/browse/SOLR-1498
should be fixed with this 1.4.0RC version, now.

Thanks!
Chantal

Chantal Ackermann schrieb:
Dear all,

my DIH config contains the following directive for the RegexTransformer:

<field column="person" groupNames="participant,role"
regex="([^\|]+)\|\d+,\d+,\d+,(.+)" />

(this is SOLR 1.4.0 RC downloaded yesterday from Grant's URL)

It expects input of the kind (version A):
Daniel Radcliffe|24897,1,1,Harry Potter

It should also work with (version B):
Daniel Radcliffe|24897,1,1,

In my index, however, I can only find documents that either contain
participant and role or neither. Of course, I didn't check all
documents. But for both fields, Luke shows the same number of documents:
Docs:  47015

(There are definitely datasets that contain participants without role.)

I'll check the code and try with a different configuration (using
sourceCol). But I thought I'd spread the news before the release is definit.

Thanks,
Chantal


Reply via email to