Re: Returning and faceting on some of the field's values

Schmidt Jeff Mon, 28 Nov 2011 23:22:43 -0800

Well, here's something that might just work.  Using the Solr 3.4+ facet.prefix 
parameter, as well as prefixing the values of the particular field I want to 
facet based on the node neighbor ID, I get what I need.


Adding the field:

          <field name="n_directionalityFacet" type="string" indexed="true" 
stored="false" multiValued="true" omitNorms="true" />

Then, for each value, I prefix it with {nodeId}-.  For example, using the focus 
node ID of ING:afa, I can get as a result document set, all of the neighbors of 
that node ID. Then, I also tell Solr to facet using that same focus node ID 
prefix:

http://localhost:8091/solr/ing-content/select/?qt=partner-tmo&fq=type%3Anode&fq=n_neighborof_id%3AING\:afa&rows=0&facet=true&facet.mincount=1&facet.field=n_directionalityFacet&f.n_directionalityFacet.facet.prefix=ING%3Aafa

And, for that particular facet, I get only the values and counts relevant to 
the focus node ID:

<lst name="facet_fields">
  <lst name="n_directionalityFacet">
    <int name="ING:afa-D">82</int>
    <int name="ING:afa-B">2</int>
    <int name="ING:afa-A">1</int>
    <int name="ING:afa-U">1</int>
  </lst>
</lst>

My app can then take this response and remove the prefix before returning the 
values and counts to the client.  It may inflate the size of index some, but it 
sure beats my alternative proposals...

Cheers,

Jeff

On Nov 26, 2011, at 1:22 PM, Jeff Schmidt wrote:

> Hello:
> 
> I'm still not finding much joy with this issue.
> 
> For one, it looks like FacetComponent (via 
> SimpleFacets.getFieldCacheCounts()) goes directly to the Lucene FieldCache 
> (non-enum, multi-valued field, single string token) in order to get terms to 
> count.  So, even if it were possible for me to somehow modify the 
> ResponseBuilder in between the QueryComponent and FacetComponent, that won't 
> do much good.
> 
> i'd rather not modify Solr/Lucene code and have a custom build (though that's 
> not impossible in the short term), but QueryComponent does not provide 
> sufficient access.  I suppose I could further investigate going the 
> RequestHandler route.  But, let me know if this is crazy talk:
> 
> From what I can tell in org.apache.solr.request.SimpleFacets, line 366 
> (sorry, no SCM info in source file, but is from the 3.4.0 source 
> distribution);
> 
>    FieldCache.StringIndex si = 
> FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
>    final String[] terms = si.lookup;
>    final int[] termNum = si.order;
> 
> SimpleFacets.getFieldCacheCounts() uses the response from the Lucene 
> FIeldCache to do its work.  My thought is to use AspectJ to place after 
> advice on the Lucene method (org.apache.lucene.search.FieldCacheImpl), to 
> modify the response.  I don't want to muck with the field cache itself. After 
> all, the field values I don't want to count for this focusNodeId, I may well 
> with another.
> 
> Given the FieldCacheImpl method:
> 
>  // inherit javadocs
>  public StringIndex getStringIndex(IndexReader reader, String field)
>      throws IOException {
>    return (StringIndex) caches.get(StringIndex.class).get(reader, new 
> Entry(field, (Parser)null));
>  }
> 
> I seems I could take the returned StringIndex instance, and create a new 
> filtered one, leaving the cached original intact. StringIndex (defined in 
> FieldCache) is public static class with a public constructor. Then, 
> SimpleFacets will facet what I provided it.
> 
> The other trick is to inform my aspect within Lucene just what the what 
> focusNodeId is, so it knows how to filter. This is request specific.  I'm 
> running Solr within Tomcat. I've not looked exhaustively into how Solr 
> threading works.  But, if the current app server request thread is used 
> synchronously to satisfy any given SolrJ request, then I could provide a 
> SearchComponent that looked for some special parameter that indicates the 
> focusNodeId of interest, and then place it in a ThreadLocal which the 
> interceptor could pick up.  If the ThreadLocal is not defined, then the 
> interceptor does not filter (a definite scenario) and returns Lucene's 
> StringIndex instance. If there is another thread involved in handling the 
> request, then more investigation is needed.
> 
> Any inside information would be appreciated.  Or, firmly stated I should not 
> go there would also be appreciated. :)
> 
> Cheers,
> 
> Jeff
> 
> On Nov 21, 2011, at 4:31 PM, Jeff Schmidt wrote:
> 
>> Hello:
>> 
>> Solr version: 3.4.0
>> 
>> I'm trying to figure out if it's possible to both return (retrieval) as well 
>> as facet on certain values of a multivalued field.  The scenario is a life 
>> science app comprised of a graph of nodes (genes, chemicals etc.) and each 
>> node has a "neighborhood" consisting of one or more nodes with which it has 
>> a relationships defined as "processes" ("inhibition", "phosphorylation" 
>> etc.).
>> 
>> What I've done is add a number of multi-valued fields to each node 
>> consisting of the neighbor node ID (neighbor's document ID), process, and 
>> couple of other related items.  For a given node, it'll have multiple 
>> neighbors, as well as multiple processes with a single neighbor.  For 
>> example, in schema.xml:
>> 
>>     <field name="id" type="string" indexed="true" stored="true" 
>> required="true" /> 
>> 
>>     <!-- Network neighborhood fields -->
>>     <field name="n_neighborof_id" type="string" indexed="true" stored="true" 
>> multiValued="true" />
>>     <field name="n_neighborof_name" type="text_lc_np" indexed="true" 
>> stored="true" multiValued="true" termVectors="true" />
>>     <field name="n_neighborof_process" type="text_lc_np" indexed="true" 
>> stored="true" multiValued="true" termVectors="true" />
>>     <field name="n_neighborof_processExact" type="string" indexed="true" 
>> stored="true" multiValued="true" termVectors="true" />
>>     <field name="n_neighborof_edge_type" type="string" indexed="true" 
>> stored="true" multiValued="true" />
>>     <field name="n_neighborof_is_direct" type="boolean" indexed="true" 
>> stored="true" multiValued="true" />
>>     <field name="n_neighborof_count" type="sint" indexed="false" 
>> stored="true" multiValued="true" />
>> 
>> Note that the type text_lc_np simply lowercases and ignores punctuation.
>> 
>> So, when I want the neighbors of a given node, I define a filter query like 
>> fq=n_neighborof_id=someFocusNodeId and I get all of the the neighbors. 
>> That's exactly what I want in terms of documents. There are a number of per 
>> document fields that are returned with the search results.  This includes 
>> the actual process information defined above. Not surprisingly, I get all 
>> all of the values for each field. But I do not want them, I only want those 
>> that pertain to the specified focus node ID.
>> 
>> For now, my workaround for the retrieval aspect of this is for my 
>> application to chuck the irrelevant values.  That is, for a set or related 
>> field values, if n_neighborof_id != focusNodeId, then out they go. While 
>> this gets the job done, it is quite wasteful in terms of both processing by 
>> both Solr and my app, as well as bandwidth.
>> 
>> Now I need to facet on a couple of the neighbor fields. Solr returns counts 
>> relevant to all processes defined within the document result set. Again, 
>> that is expected, but not what I want.  I'd like Solr to compute facet 
>> counts only for processes relevant to the specified focus node, much like my 
>> filter query to get the document results.
>> 
>> Is this possible?  I've looked at grouping queries, though those are 
>> document centric and do not work for multivalued fields. I've looked into 
>> implementing my own SearchComponent within the Solr server.  It sounded 
>> ideal to drop something I have control over right between the standard query 
>> and facet components. I figured I could eliminate the undesired fields at 
>> that point, both solving my first problem of having to toss irrelevant 
>> processes in my app, and having Solr compute facet values using only the 
>> desired processes.  But, there are comments in the Solr source code that 
>> stipulates a component must not modify the document set.  For example, in 
>> org.apache.solr.search.DocSet:
>> 
>> /**
>> * <code>DocSet</code> represents an unordered set of Lucene Document Ids.
>> *
>> * <p>
>> * WARNING: Any DocSet returned from SolrIndexSearcher should <b>not</b> be 
>> modified as it may have been retrieved from
>> * a cache and could be shared.
>> * </p>
>> *
>> * @version $Id: DocSet.java 1065312 2011-01-30 16:08:25Z rmuir $
>> * @since solr 0.9
>> */
>> 
>> Perhaps I cannot use this avenue to accomplish my goals?  But, I don't need 
>> to modify the document set itself (IDs etc.), just trim the field values per 
>> document. Does that make sense?
>> 
>> I may well have to re-evaluate my data model, but I'd like to get what I 
>> need with what I have currently defined if possible.
>> 
>> Thanks,
>> 
>> Jeff
>> --
>> Jeff Schmidt
>> 535 Consulting
>> j...@535consulting.com
>> http://www.535consulting.com
>> (650) 423-1068
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> --
> Jeff Schmidt
> 535 Consulting
> j...@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Returning and faceting on some of the field's values

Reply via email to