By saying "dirty data" you imply that only one of the values is "good" or "clean" and that the others can be safely discarded/ignored, as opposed to true multi-valued data where each value is there for good reason and needs to be preserved. In any case, how do you know/decide which value should be used for sorting - and did you just get lucky that Solr happened to use the right one?

The preferred technique would be the preprocess and "clean" the data before it is handed to Solr or SolrJ, even if the source must remain "dirty". Baring that a preprocessor or a custom update processor certainly.

Please clarify exactly how the data is being fed into Solr.

And if you really do need to preserve the multiple values, simply store them in a separate field that is not sorted. An update processor can do this as well.

-- Jack Krupansky

-----Original Message----- From: Erick Erickson
Sent: Tuesday, June 05, 2012 6:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Older versions of Solr didn't really sort correctly on multivalued fields, they
just didn't complain <G>.....

Hmmm. Off the top of my head, you can:
1> You don't say what the documents to be indexed are. Are they Solr-style
    documents on disk or do you process them with, say, a SolrJ program?
If the latter, you can simply inspect them as you construct them and decide
    which of the multi-valued field values you want to use to sort
and copy that
    single value into a new field and sort on that.
2> You could write a custom UpdateRequestProcessorFactory/UpdateRequestProcessor
    pair and do the same thing in the processAdd method.

Best
Erick

On Mon, Jun 4, 2012 at 10:17 PM, Aaron Daubman <daub...@gmail.com> wrote:
Greetings,

I have "dirty" source data where some documents being indexed, although
unlikely, may contain multivalued fields that are also required for
sorting. In previous versions of Solr, sorting on this field worked fine
(possibly because few or no multivalued fields were ever encountered?),
however, as of 3.6.0, thanks to
https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
field now throws an error:

[2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
org.apache.solr.common.SolrException: can not sort on multivalued field:
f_normalizedValue

The relevant bits of the schema.xml are:
<fieldType name="sfloat" class="solr.TrieFloatField" precisionStep="0"
positionIncrementGap="0" sortMissingLast="true"/>
<dynamicField name="f_*" type="sfloat" indexed="true" stored="true"
required="false" multiValued="true"/>

Assuming that the source documents being indexed cannot be changed (which,
at least for now, they cannot), what would be the next best way to allow
for both the possibility of multiple f_normalizedValue fields appearing in
indexed documents, as wel as being able to sort by f_normalizedValue?

Thank you,
Aaron

Reply via email to