Re: Returning and faceting on some of the field's values

Jeff Schmidt Sat, 26 Nov 2011 12:22:37 -0800

Hello:

I'm still not finding much joy with this issue.


For one, it looks like FacetComponent (via SimpleFacets.getFieldCacheCounts()) 
goes directly to the Lucene FieldCache (non-enum, multi-valued field, single 
string token) in order to get terms to count.  So, even if it were possible for 
me to somehow modify the ResponseBuilder in between the QueryComponent and 
FacetComponent, that won't do much good.

i'd rather not modify Solr/Lucene code and have a custom build (though that's 
not impossible in the short term), but QueryComponent does not provide 
sufficient access.  I suppose I could further investigate going the 
RequestHandler route.  But, let me know if this is crazy talk:

From what I can tell in org.apache.solr.request.SimpleFacets, line 366 (sorry, 
no SCM info in source file, but is from the 3.4.0 source distribution);

    FieldCache.StringIndex si = 
FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
    final String[] terms = si.lookup;
    final int[] termNum = si.order;

SimpleFacets.getFieldCacheCounts() uses the response from the Lucene FIeldCache 
to do its work.  My thought is to use AspectJ to place after advice on the 
Lucene method (org.apache.lucene.search.FieldCacheImpl), to modify the 
response.  I don't want to muck with the field cache itself. After all, the 
field values I don't want to count for this focusNodeId, I may well with 
another.

Given the FieldCacheImpl method:

  // inherit javadocs
  public StringIndex getStringIndex(IndexReader reader, String field)
      throws IOException {
    return (StringIndex) caches.get(StringIndex.class).get(reader, new 
Entry(field, (Parser)null));
  }

I seems I could take the returned StringIndex instance, and create a new 
filtered one, leaving the cached original intact. StringIndex (defined in 
FieldCache) is public static class with a public constructor. Then, 
SimpleFacets will facet what I provided it.

The other trick is to inform my aspect within Lucene just what the what 
focusNodeId is, so it knows how to filter. This is request specific.  I'm 
running Solr within Tomcat. I've not looked exhaustively into how Solr 
threading works.  But, if the current app server request thread is used 
synchronously to satisfy any given SolrJ request, then I could provide a 
SearchComponent that looked for some special parameter that indicates the 
focusNodeId of interest, and then place it in a ThreadLocal which the 
interceptor could pick up.  If the ThreadLocal is not defined, then the 
interceptor does not filter (a definite scenario) and returns Lucene's 
StringIndex instance. If there is another thread involved in handling the 
request, then more investigation is needed.

Any inside information would be appreciated.  Or, firmly stated I should not go 
there would also be appreciated. :)

Cheers,

Jeff

On Nov 21, 2011, at 4:31 PM, Jeff Schmidt wrote:

> Hello:
> 
> Solr version: 3.4.0
> 
> I'm trying to figure out if it's possible to both return (retrieval) as well 
> as facet on certain values of a multivalued field.  The scenario is a life 
> science app comprised of a graph of nodes (genes, chemicals etc.) and each 
> node has a "neighborhood" consisting of one or more nodes with which it has a 
> relationships defined as "processes" ("inhibition", "phosphorylation" etc.).
> 
> What I've done is add a number of multi-valued fields to each node consisting 
> of the neighbor node ID (neighbor's document ID), process, and couple of 
> other related items.  For a given node, it'll have multiple neighbors, as 
> well as multiple processes with a single neighbor.  For example, in 
> schema.xml:
> 
>      <field name="id" type="string" indexed="true" stored="true" 
> required="true" /> 
> 
>      <!-- Network neighborhood fields -->
>      <field name="n_neighborof_id" type="string" indexed="true" stored="true" 
> multiValued="true" />
>      <field name="n_neighborof_name" type="text_lc_np" indexed="true" 
> stored="true" multiValued="true" termVectors="true" />
>      <field name="n_neighborof_process" type="text_lc_np" indexed="true" 
> stored="true" multiValued="true" termVectors="true" />
>      <field name="n_neighborof_processExact" type="string" indexed="true" 
> stored="true" multiValued="true" termVectors="true" />
>      <field name="n_neighborof_edge_type" type="string" indexed="true" 
> stored="true" multiValued="true" />
>      <field name="n_neighborof_is_direct" type="boolean" indexed="true" 
> stored="true" multiValued="true" />
>      <field name="n_neighborof_count" type="sint" indexed="false" 
> stored="true" multiValued="true" />
> 
> Note that the type text_lc_np simply lowercases and ignores punctuation.
> 
> So, when I want the neighbors of a given node, I define a filter query like 
> fq=n_neighborof_id=someFocusNodeId and I get all of the the neighbors. That's 
> exactly what I want in terms of documents. There are a number of per document 
> fields that are returned with the search results.  This includes the actual 
> process information defined above. Not surprisingly, I get all all of the 
> values for each field. But I do not want them, I only want those that pertain 
> to the specified focus node ID.
> 
> For now, my workaround for the retrieval aspect of this is for my application 
> to chuck the irrelevant values.  That is, for a set or related field values, 
> if n_neighborof_id != focusNodeId, then out they go. While this gets the job 
> done, it is quite wasteful in terms of both processing by both Solr and my 
> app, as well as bandwidth.
> 
> Now I need to facet on a couple of the neighbor fields. Solr returns counts 
> relevant to all processes defined within the document result set. Again, that 
> is expected, but not what I want.  I'd like Solr to compute facet counts only 
> for processes relevant to the specified focus node, much like my filter query 
> to get the document results.
> 
> Is this possible?  I've looked at grouping queries, though those are document 
> centric and do not work for multivalued fields. I've looked into implementing 
> my own SearchComponent within the Solr server.  It sounded ideal to drop 
> something I have control over right between the standard query and facet 
> components. I figured I could eliminate the undesired fields at that point, 
> both solving my first problem of having to toss irrelevant processes in my 
> app, and having Solr compute facet values using only the desired processes.  
> But, there are comments in the Solr source code that stipulates a component 
> must not modify the document set.  For example, in 
> org.apache.solr.search.DocSet:
> 
> /**
> * <code>DocSet</code> represents an unordered set of Lucene Document Ids.
> *
> * <p>
> * WARNING: Any DocSet returned from SolrIndexSearcher should <b>not</b> be 
> modified as it may have been retrieved from
> * a cache and could be shared.
> * </p>
> *
> * @version $Id: DocSet.java 1065312 2011-01-30 16:08:25Z rmuir $
> * @since solr 0.9
> */
> 
> Perhaps I cannot use this avenue to accomplish my goals?  But, I don't need 
> to modify the document set itself (IDs etc.), just trim the field values per 
> document. Does that make sense?
> 
> I may well have to re-evaluate my data model, but I'd like to get what I need 
> with what I have currently defined if possible.
> 
> Thanks,
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> j...@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 



--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com
(650) 423-1068

Re: Returning and faceting on some of the field's values

Reply via email to