Re: UpdateProcessor and copyField

Jan Høydahl Thu, 24 Feb 2011 00:47:46 -0800

Hi,

I'd also like a more powerful/generic CopyField.

Today <copyField> always copies after UpdateChain and before analysis.
Refactoring it as an UP (using SOLR-2370 to include it as part of default 
chain) would  let us specify before UpdateChain in addition. But how could we 
get it to copy after analysis?

Imagine these lines in schema.xml:
<copyField source="my_raw_keywords" dest="keywords" when="preUpdate" 
append="true" />
<copyField source="my_raw_keywords2" dest="keywords" when="preUpdate" 
append="true" />
<copyField source="keywords" dest="keywords_facet" /> // Default 
when=preAnalysis
<copyField source="keywords" dest="keywords_stemmed" />
<copyField source="keywords_stemmed" dest="all_stemmed" when="postAnalysis" 
append="true" />

This would read in two source fields and merge them into the "keywords" field 
before UpdateChain is run. UpdateChain may do various magic with the field, and 
then before analysis it is copied to two fields, for facet and a stemmed 
version. After analysis we copy the stemmed field to another stemmed field 
(must be same field Class and multiValued of course). The PostAnalysis copying 
would also allow for some advanced hacking by copying results of different 
fieldTypes into one, enabling the usecase of lemmatization by expansion on the 
index side and thus querying multiple languages in the one and same field.

From my understanding, the RunUpdateProcessor is one monolithic beast passing 
the doc along for analysis and indexing. Would it be possible to split it in 
two, one AnalysisUpdateProcessor and one IndexUpdateProcessor?

Chris, for the custom field manipulations in custom UpdateChains it makes sense 
with a "FieldManipulator" UpdateProcessor which can be inserted wherever you 
like, and depending on use case. I believe this can/should exist independently 
from a refactoring of <copyField>

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 24. feb. 2011, at 03.16, Chris Hostetter wrote:

> 
> : > Maybe copy fields should be refactored to happen in a new, core, 
> : update processor, so there is nothing special/awkward about them?  It 
> : seems they fit as part of what an update processor is all about, 
> : augmenting/modifying incoming documents.
> : 
> : Seems reasonable.
> : By default, the copyFields could be read from the schema for back
> : compat (and the fact that copyField does feel more natural in the
> : schema)
> 
> As someone who has written special case UpdateProcessors that clone field 
> values, I agree that it would be handy to have a new generic 
> "CopyFieldUpdateProcessor" but i'm not really on board the idea of it 
> reading <copyField .. /> declarations by default.  the ideas really serve 
> differnet purposes...
> 
> * as an UpdateProcessor it's something that can be 
> adjusted/configured/overridden on a use cases basis - some request 
> handlers could be confgured to use a processor chain that includes the 
> CopyFieldUpdateProcessor and some could be configured not to.
> 
> * schema copyField declarations are things hat happen to *every* document, 
> regardless of where it comes from.
> 
> the use cases would be very differnet: consider a schema with many 
> differnet fields specific to certain types of documents, as well as a few 
> required fields that every type of document must have: "title", 
> "description", "body", and "maintext" fields.  it might make sense for 
> to use differnet processor chains along with a 
> CopyFieldUpdateProcessor to clone some some other fields (say: an 
> "dust_jacked_text" field for books, and a "plot_summary" field for movies) 
> into the "description" field when those docs are indexed -- but if you 
> absolutely positively *allways* wanted the contents of title, description, 
> and body to be copied into the "maintext" field that would make more sense 
> as a schema.xml declaration.
> 
> likewise: it would be handy t have an UpdateProcessor that rejected 
> documents that were missing some fields -- but that would not be a true 
> substitute for using required="true" on a field in the schema.xml.
> 
> a single index may have multiple valid processor chains for differnet 
> indexing situations -- but "rules" declared in the schema.xml are absolute 
> and can not be circumvented.
> 
> 
> -Hoss

Re: UpdateProcessor and copyField

Reply via email to