I've recently started looking at using the updateRequestProcessorChain to 
ensure the presence of certain fields in our solr records.   The reason for 
doing so is that we have records from several different sources, that are 
processed in different ways, and by adding the field via the 
updateRequestProcessorChain I don't have to duplicate the logic for how to 
create the fields in several different places.

At first it seemed that I might be able to accomplish what I needed to do with 
the TemplateUpdateProcessorFactory and the CloneFieldUpdateProcessorFactory and 
the RegexReplaceProcessorFactory ,  but I quickly went beyond what they can 
easily accomplish.

example1:
A document will have one or more   pool_f_stored  value(s)   and a 
full_title_tsearch_stored  value.
generate a field where the field name(s) is drawn from the pool_f_stored 
value(s) and the field value is equal to the value from the 
full_title_tsearch_stored field.  (Adding a pool specific title browse field)

example2:
A document will have one (or more) values in a field named 
uva_availability_f_stored, these values will be from the following set of 
strings {  Online, On shelf , Request, <anything else> }   these strings should 
be mapped to  integer values  { 3,  2, 1, 0 } respectively, and a field named  
uva_availability_isort should be added with only the largest of those values.

So I tried using the StatelessScriptUpdateProcessorFactory and wrote short 
javascript implementations to accomplish the above, and called the scripts from 
the updateRequestProcessorChain  and tested, and everything seemed great.

However when I ran the bulk of our 9 million records through the indexing 
process, solr would repeatedly, unceremoniously throw a OOM error and 
terminate.   Usually citing  " # java.lang.OutOfMemoryError: Metaspace"  as the 
reason.
The only difference is that now I am calling the three javascript scripts 
during the updateRequestProcessorChain

If I comment out those steps in the updateRequestProcessorChain  I can index 
all 9 million items and have no problem.

Any thoughts on why this would be the case?   Any suggestions on how to track 
this down?   Any known "gotchas" with using javascript scripts from within the 
updateRequestProcessorChain  ?

Java version:
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
Solr version:
solr-spec    7.3.0
solr-impl    7.3.0 98a6b3d642928b1ac9076c6c5a369472581f7633 - woody - 
2018-03-28 14:37:45

javascript for example 1:

function processAdd(cmd) {

  doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
  field_value_name = params.get("field_value");
  field_value = doc.getFieldValue(field_value_name);
  logger.debug("update-script#processAdd: field_value=" + field_value);

  if (field_value != null)
  {
      field_name_name = params.get("field_name");
      field_names_response = doc.getFieldValues(field_name_name);
      field_names = (field_names_response != null) ? 
field_names_response.toArray() : null;
      for(i=0; field_names != null && i < field_names.length; i++)
      {
          field_name = "full_"+field_names[i]+"_title_f";
          doc.setField(field_name, field_value);
      }
  }
}

SolrConfig.xml  to call script:

     <processor class="solr.StatelessScriptUpdateProcessorFactory">
         <str name="script">title_browse.js</str>
         <lst name="params">
            <str name="field_name">pool_f_stored</str>
            <str name="field_value">full_title_tsearchf_stored</str>
         </lst>
     </processor>

javascript for example 2:

function processAdd(cmd) {

  doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
  field_name = params.get("field_name");
  field_value_name = params.get("field_value");
  logger.debug("update-script#processAdd: field_value_name=" + 
field_value_name);
  field_values_result = doc.getFieldValues(field_value_name);
  field_values = (field_values_result != null) ? field_values_result.toArray() 
: null;
  logger.debug("update-script#processAdd: field_value count=" + (field_values 
== null ? "null" : " " + (field_values.length)));

  if (field_name != null && field_values != null && field_values.length > 0)
  {
//      logger.debug("update-script#processAdd: field_value=" + field_value);
      value = 0;
      for(i=0; i < field_values.length; i++)
      {
          field_value = field_values[i];
          if (field_value.equals("Request"))  value = Math.max(value, 1);
          else if (field_value.equals("On shelf")) value = Math.max(value, 2);
          else if (field_value.equals("Online"))   value = Math.max(value, 3);
      }
      doc.setField(field_name, value);
  }
}

SolrConfig.xml to call example 2 script:

     <processor class="solr.StatelessScriptUpdateProcessorFactory">
         <str name="script">availability_rank.js</str>
         <lst name="params">
            <str name="field_name">uva_availability_isort</str>
            <str name="field_value">uva_availability_f_stored</str>
         </lst>
     </processor>


Reply via email to