I've recently started looking at using the updateRequestProcessorChain to
ensure the presence of certain fields in our solr records. The reason for
doing so is that we have records from several different sources, that are
processed in different ways, and by adding the field via the
updateRequestProcessorChain I don't have to duplicate the logic for how to
create the fields in several different places.
At first it seemed that I might be able to accomplish what I needed to do with
the TemplateUpdateProcessorFactory and the CloneFieldUpdateProcessorFactory and
the RegexReplaceProcessorFactory , but I quickly went beyond what they can
easily accomplish.
example1:
A document will have one or more pool_f_stored value(s) and a
full_title_tsearch_stored value.
generate a field where the field name(s) is drawn from the pool_f_stored
value(s) and the field value is equal to the value from the
full_title_tsearch_stored field. (Adding a pool specific title browse field)
example2:
A document will have one (or more) values in a field named
uva_availability_f_stored, these values will be from the following set of
strings { Online, On shelf , Request, <anything else> } these strings should
be mapped to integer values { 3, 2, 1, 0 } respectively, and a field named
uva_availability_isort should be added with only the largest of those values.
So I tried using the StatelessScriptUpdateProcessorFactory and wrote short
javascript implementations to accomplish the above, and called the scripts from
the updateRequestProcessorChain and tested, and everything seemed great.
However when I ran the bulk of our 9 million records through the indexing
process, solr would repeatedly, unceremoniously throw a OOM error and
terminate. Usually citing " # java.lang.OutOfMemoryError: Metaspace" as the
reason.
The only difference is that now I am calling the three javascript scripts
during the updateRequestProcessorChain
If I comment out those steps in the updateRequestProcessorChain I can index
all 9 million items and have no problem.
Any thoughts on why this would be the case? Any suggestions on how to track
this down? Any known "gotchas" with using javascript scripts from within the
updateRequestProcessorChain ?
Java version:
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
Solr version:
solr-spec 7.3.0
solr-impl 7.3.0 98a6b3d642928b1ac9076c6c5a369472581f7633 - woody -
2018-03-28 14:37:45
javascript for example 1:
function processAdd(cmd) {
doc = cmd.solrDoc; // org.apache.solr.common.SolrInputDocument
field_value_name = params.get("field_value");
field_value = doc.getFieldValue(field_value_name);
logger.debug("update-script#processAdd: field_value=" + field_value);
if (field_value != null)
{
field_name_name = params.get("field_name");
field_names_response = doc.getFieldValues(field_name_name);
field_names = (field_names_response != null) ?
field_names_response.toArray() : null;
for(i=0; field_names != null && i < field_names.length; i++)
{
field_name = "full_"+field_names[i]+"_title_f";
doc.setField(field_name, field_value);
}
}
}
SolrConfig.xml to call script:
<processor class="solr.StatelessScriptUpdateProcessorFactory">
<str name="script">title_browse.js</str>
<lst name="params">
<str name="field_name">pool_f_stored</str>
<str name="field_value">full_title_tsearchf_stored</str>
</lst>
</processor>
javascript for example 2:
function processAdd(cmd) {
doc = cmd.solrDoc; // org.apache.solr.common.SolrInputDocument
field_name = params.get("field_name");
field_value_name = params.get("field_value");
logger.debug("update-script#processAdd: field_value_name=" +
field_value_name);
field_values_result = doc.getFieldValues(field_value_name);
field_values = (field_values_result != null) ? field_values_result.toArray()
: null;
logger.debug("update-script#processAdd: field_value count=" + (field_values
== null ? "null" : " " + (field_values.length)));
if (field_name != null && field_values != null && field_values.length > 0)
{
// logger.debug("update-script#processAdd: field_value=" + field_value);
value = 0;
for(i=0; i < field_values.length; i++)
{
field_value = field_values[i];
if (field_value.equals("Request")) value = Math.max(value, 1);
else if (field_value.equals("On shelf")) value = Math.max(value, 2);
else if (field_value.equals("Online")) value = Math.max(value, 3);
}
doc.setField(field_name, value);
}
}
SolrConfig.xml to call example 2 script:
<processor class="solr.StatelessScriptUpdateProcessorFactory">
<str name="script">availability_rank.js</str>
<lst name="params">
<str name="field_name">uva_availability_isort</str>
<str name="field_value">uva_availability_f_stored</str>
</lst>
</processor>