Re: Returning and faceting on some of the field's values

Jeff Schmidt Tue, 29 Nov 2011 13:50:46 -0800

This does appear to work well.  It seems there are not many people interested 
in this particular problem now, but I figured I'd just complete the story in 
case it helps somebody in the future.


With the neighbor node ID prefixes, I'm getting the facet values and counts as 
I require.   Since I have an application standing between Solr and the client, 
I can play these games with the prefix.  When I build my content for indexing 
(Solr compatible XML), I add the prefixes.  When I return facet results to the 
client, I remove the prefixes before they pass through.  When the client 
specifies one or more values to pin down for the facet (drill down), I add the 
prefixes as I configure the Solr query via SolrJ.

In the future, I'm sure I'll be asked to facet more of the process fields that 
are specific between two nodes.  I guess I'll just expand the use of the 
prefixes to more fields.

Take it easy,

Jeff

On Nov 28, 2011, at 9:06 PM, Schmidt Jeff wrote:

> Well, here's something that might just work.  Using the Solr 3.4+ 
> facet.prefix parameter, as well as prefixing the values of the particular 
> field I want to facet based on the node neighbor ID, I get what I need.
> 
> Adding the field:
> 
>         <field name="n_directionalityFacet" type="string" indexed="true" 
> stored="false" multiValued="true" omitNorms="true" />
> 
> Then, for each value, I prefix it with {nodeId}-.  For example, using the 
> focus node ID of ING:afa, I can get as a result document set, all of the 
> neighbors of that node ID. Then, I also tell Solr to facet using that same 
> focus node ID prefix:
> 
> http://localhost:8091/solr/ing-content/select/?qt=partner-tmo&fq=type%3Anode&fq=n_neighborof_id%3AING\:afa&rows=0&facet=true&facet.mincount=1&facet.field=n_directionalityFacet&f.n_directionalityFacet.facet.prefix=ING%3Aafa
> 
> And, for that particular facet, I get only the values and counts relevant to 
> the focus node ID:
> 
> <lst name="facet_fields">
>  <lst name="n_directionalityFacet">
>    <int name="ING:afa-D">82</int>
>    <int name="ING:afa-B">2</int>
>    <int name="ING:afa-A">1</int>
>    <int name="ING:afa-U">1</int>
>  </lst>
> </lst>
> 
> My app can then take this response and remove the prefix before returning the 
> values and counts to the client.  It may inflate the size of index some, but 
> it sure beats my alternative proposals...
> 
> Cheers,
> 
> Jeff
> 
> On Nov 26, 2011, at 1:22 PM, Jeff Schmidt wrote:
> 
>> Hello:
>> 
>> I'm still not finding much joy with this issue.
>> 
>> For one, it looks like FacetComponent (via 
>> SimpleFacets.getFieldCacheCounts()) goes directly to the Lucene FieldCache 
>> (non-enum, multi-valued field, single string token) in order to get terms to 
>> count.  So, even if it were possible for me to somehow modify the 
>> ResponseBuilder in between the QueryComponent and FacetComponent, that won't 
>> do much good.
>> 
>> i'd rather not modify Solr/Lucene code and have a custom build (though 
>> that's not impossible in the short term), but QueryComponent does not 
>> provide sufficient access.  I suppose I could further investigate going the 
>> RequestHandler route.  But, let me know if this is crazy talk:
>> 
>> From what I can tell in org.apache.solr.request.SimpleFacets, line 366 
>> (sorry, no SCM info in source file, but is from the 3.4.0 source 
>> distribution);
>> 
>>   FieldCache.StringIndex si = 
>> FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
>>   final String[] terms = si.lookup;
>>   final int[] termNum = si.order;
>> 
>> SimpleFacets.getFieldCacheCounts() uses the response from the Lucene 
>> FIeldCache to do its work.  My thought is to use AspectJ to place after 
>> advice on the Lucene method (org.apache.lucene.search.FieldCacheImpl), to 
>> modify the response.  I don't want to muck with the field cache itself. 
>> After all, the field values I don't want to count for this focusNodeId, I 
>> may well with another.
>> 
>> Given the FieldCacheImpl method:
>> 
>> // inherit javadocs
>> public StringIndex getStringIndex(IndexReader reader, String field)
>>     throws IOException {
>>   return (StringIndex) caches.get(StringIndex.class).get(reader, new 
>> Entry(field, (Parser)null));
>> }
>> 
>> I seems I could take the returned StringIndex instance, and create a new 
>> filtered one, leaving the cached original intact. StringIndex (defined in 
>> FieldCache) is public static class with a public constructor. Then, 
>> SimpleFacets will facet what I provided it.
>> 
>> The other trick is to inform my aspect within Lucene just what the what 
>> focusNodeId is, so it knows how to filter. This is request specific.  I'm 
>> running Solr within Tomcat. I've not looked exhaustively into how Solr 
>> threading works.  But, if the current app server request thread is used 
>> synchronously to satisfy any given SolrJ request, then I could provide a 
>> SearchComponent that looked for some special parameter that indicates the 
>> focusNodeId of interest, and then place it in a ThreadLocal which the 
>> interceptor could pick up.  If the ThreadLocal is not defined, then the 
>> interceptor does not filter (a definite scenario) and returns Lucene's 
>> StringIndex instance. If there is another thread involved in handling the 
>> request, then more investigation is needed.
>> 
>> Any inside information would be appreciated.  Or, firmly stated I should not 
>> go there would also be appreciated. :)
>> 
>> Cheers,
>> 
>> Jeff
>> 
>> On Nov 21, 2011, at 4:31 PM, Jeff Schmidt wrote:
>> 
>>> Hello:
>>> 
>>> Solr version: 3.4.0
>>> 
>>> I'm trying to figure out if it's possible to both return (retrieval) as 
>>> well as facet on certain values of a multivalued field.  The scenario is a 
>>> life science app comprised of a graph of nodes (genes, chemicals etc.) and 
>>> each node has a "neighborhood" consisting of one or more nodes with which 
>>> it has a relationships defined as "processes" ("inhibition", 
>>> "phosphorylation" etc.).
>>> 
>>> What I've done is add a number of multi-valued fields to each node 
>>> consisting of the neighbor node ID (neighbor's document ID), process, and 
>>> couple of other related items.  For a given node, it'll have multiple 
>>> neighbors, as well as multiple processes with a single neighbor.  For 
>>> example, in schema.xml:
>>> 
>>>    <field name="id" type="string" indexed="true" stored="true" 
>>> required="true" /> 
>>> 
>>>    <!-- Network neighborhood fields -->
>>>    <field name="n_neighborof_id" type="string" indexed="true" stored="true" 
>>> multiValued="true" />
>>>    <field name="n_neighborof_name" type="text_lc_np" indexed="true" 
>>> stored="true" multiValued="true" termVectors="true" />
>>>    <field name="n_neighborof_process" type="text_lc_np" indexed="true" 
>>> stored="true" multiValued="true" termVectors="true" />
>>>    <field name="n_neighborof_processExact" type="string" indexed="true" 
>>> stored="true" multiValued="true" termVectors="true" />
>>>    <field name="n_neighborof_edge_type" type="string" indexed="true" 
>>> stored="true" multiValued="true" />
>>>    <field name="n_neighborof_is_direct" type="boolean" indexed="true" 
>>> stored="true" multiValued="true" />
>>>    <field name="n_neighborof_count" type="sint" indexed="false" 
>>> stored="true" multiValued="true" />
>>> 
>>> Note that the type text_lc_np simply lowercases and ignores punctuation.
>>> 
>>> So, when I want the neighbors of a given node, I define a filter query like 
>>> fq=n_neighborof_id=someFocusNodeId and I get all of the the neighbors. 
>>> That's exactly what I want in terms of documents. There are a number of per 
>>> document fields that are returned with the search results.  This includes 
>>> the actual process information defined above. Not surprisingly, I get all 
>>> all of the values for each field. But I do not want them, I only want those 
>>> that pertain to the specified focus node ID.
>>> 
>>> For now, my workaround for the retrieval aspect of this is for my 
>>> application to chuck the irrelevant values.  That is, for a set or related 
>>> field values, if n_neighborof_id != focusNodeId, then out they go. While 
>>> this gets the job done, it is quite wasteful in terms of both processing by 
>>> both Solr and my app, as well as bandwidth.
>>> 
>>> Now I need to facet on a couple of the neighbor fields. Solr returns counts 
>>> relevant to all processes defined within the document result set. Again, 
>>> that is expected, but not what I want.  I'd like Solr to compute facet 
>>> counts only for processes relevant to the specified focus node, much like 
>>> my filter query to get the document results.
>>> 
>>> Is this possible?  I've looked at grouping queries, though those are 
>>> document centric and do not work for multivalued fields. I've looked into 
>>> implementing my own SearchComponent within the Solr server.  It sounded 
>>> ideal to drop something I have control over right between the standard 
>>> query and facet components. I figured I could eliminate the undesired 
>>> fields at that point, both solving my first problem of having to toss 
>>> irrelevant processes in my app, and having Solr compute facet values using 
>>> only the desired processes.  But, there are comments in the Solr source 
>>> code that stipulates a component must not modify the document set.  For 
>>> example, in org.apache.solr.search.DocSet:
>>> 
>>> /**
>>> * <code>DocSet</code> represents an unordered set of Lucene Document Ids.
>>> *
>>> * <p>
>>> * WARNING: Any DocSet returned from SolrIndexSearcher should <b>not</b> be 
>>> modified as it may have been retrieved from
>>> * a cache and could be shared.
>>> * </p>
>>> *
>>> * @version $Id: DocSet.java 1065312 2011-01-30 16:08:25Z rmuir $
>>> * @since solr 0.9
>>> */
>>> 
>>> Perhaps I cannot use this avenue to accomplish my goals?  But, I don't need 
>>> to modify the document set itself (IDs etc.), just trim the field values 
>>> per document. Does that make sense?
>>> 
>>> I may well have to re-evaluate my data model, but I'd like to get what I 
>>> need with what I have currently defined if possible.
>>> 
>>> Thanks,
>>> 
>>> Jeff
>>> --
>>> Jeff Schmidt
>>> 535 Consulting
>>> j...@535consulting.com
>>> http://www.535consulting.com
>>> (650) 423-1068
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Jeff Schmidt
>> 535 Consulting
>> j...@535consulting.com
>> http://www.535consulting.com
>> (650) 423-1068
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 

--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com
(650) 423-1068

Re: Returning and faceting on some of the field's values

Reply via email to