Re: access matched token ids in the FacetComponent?

Dmitry Kan Tue, 05 Mar 2013 06:17:27 -0800

Hello,

I spent some more time on this and used Mikhail's suggestions of which
classes would need to be implemented.


1. Since we use SpanQuery family, we would need to modify the SpanScorer to
collect some stats over matched spans.
2. DelegatingCollector takes Scorer class via setScorer() method. The class
will have access to the statistics that is collected in the SpanScorer
class.
3. This DelegatingCollector class should then be referenced in the
SolrIndexSearcher class. There will be a need to implement some getter
methods for accessing the above statistics.
4. Make use of this modified SolrIndexSearcher in the SimpleFacets class.
5. Access the statistics that is visible in the SimpleFacets class in the
FacetComponent, in the method process().

Does this sound like an accurate list of classes to modify? Am I missing
something, any road blocks?

Dmitry

On Wed, Jan 23, 2013 at 12:47 PM, Dmitry Kan <solrexp...@gmail.com> wrote:

> Thanks Alexandre for correcting the link and Mikhail for sharing the ideas!
>
> Mihkail,
>
> I will need to look closer at your customization of SpansFacetComponent on
> the blogpost.
> Is it so, that in this component, you are accessing and counting the
> matched spans?
>
> Thanks,
>
> Dmitry
>
>
> On Tue, Jan 22, 2013 at 9:17 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> Dmitry,
>>
>> Solr faceting is really fast due to using in-memory approach (keeping few
>> noticeable exceptions in mind), hence spans should be slower. Reading term
>> positions/payloads always has sensible gain. You can estimate it, if you
>> compare time for a phrase query "foo bar" with a plain conjunction +foo
>> +bar one.
>> It worth to mention that our SpansFacetComponent performed well enough,
>> even for public site. You can find my comment about performance numbers
>> "64К docs with 5-20 span positions per each. Search result length 100-2000
>> docs with 3-5 facet fields. It shows 100 q/sec on an average datacenter
>> box."
>>
>>
>> On Mon, Jan 21, 2013 at 5:23 PM, Dmitry Kan <solrexp...@gmail.com> wrote:
>>
>> > Mikhail,
>> >
>> > Thanks for the guidance! This indeed sounds challenging, esp. given the
>> > bonus of fighting with solr 3.x in light of disjunction queries.
>> Although,
>> > moving to solr 4.0 if this makes life easier should be ok.
>> >
>> > But even before getting one's hands dirty, it would be good to know, if
>> > this is going to fly performance wise. Has your span based
>> implementation
>> > been fast enough? Did it stand close to the native solr's faceting in
>> terms
>> > of performance?
>> >
>> > On Mon, Jan 21, 2013 at 2:33 PM, Mikhail Khludnev <
>> > mkhlud...@griddynamics.com> wrote:
>> >
>> > > Dmitry,
>> > >
>> > > First of all, FacetComponent is the Solr's out-of-the-box
>> functionality.
>> > It
>> > > runs after search is done and accesses the bitSet of the found
>> document,
>> > > i.e. there is no spans (matched terms positions) there at all.
>> > >
>> > > StandardFacetsAccumulator sounds like the "brand new" lucene faceting
>> > > library. see http://shaierera.blogspot.com/. I don't think but don't
>> > > exactly know whether they are accessible there too.
>> > >
>> > > Some time ago my team successfully prototyped facet component backed
>> on
>> > > spans
>> > >
>> >
>> blog.griddynamics.com/2011/10/solr-experience-search-parent-child.htmlbut
>> > > I don't suggest you go this way.
>> > > I can suggest you start from the following:
>> > > - supply PostFilter/DelegatingCollector
>> > > http://yonik.com/posts/advanced-filter-caching-in-solr/
>> > > - the DelegatingCollector will accept the scorer instance
>> > > - if this scorer is BooleanScorer2 (but not BooleanScorer!), you can
>> > access
>> > > the SpanQueryScorer in one of the legs and try to access the matched
>> > spans
>> > > - if you are in 3.x you'll have a problem with disjunction queries.
>> > >
>> > > it seems challenging, doesn't it?
>> > >
>> > > 18.01.2013 17:40 пользователь "Dmitry Kan" <solrexp...@gmail.com>
>> > написал:
>> > >
>> > > > Mikhail,
>> > > >
>> > > > Do you say, that it is not possible to access the matched terms
>> > positions
>> > > > in the FacetComponent? If that would be possible (somewhere in the
>> > > > StandardFacetsAccumulator class, where docids are available), then
>> by
>> > > > knowing the matched term positions I can do some school simple math
>> to
>> > > > calculate the sentence counts per doc id.
>> > > >
>> > > > Dmitry
>> > > >
>> > > > On Fri, Jan 18, 2013 at 2:45 PM, Mikhail Khludnev <
>> > > > mkhlud...@griddynamics.com> wrote:
>> > > >
>> > > > > Dmitry,
>> > > > >
>> > > > > It definitely seems like postptocessing highlighter's output. The
>> > also
>> > > > > approach is:
>> > > > > - limit number of occurrences of a word in a sentence to 1
>> > > > > - play with facet by function patch
>> > > > > https://issues.apache.org/jira/browse/SOLR-1581 accomplished by
>> tf()
>> > > > > function.
>> > > > >
>> > > > > It doesn't seem like much help.
>> > > > >
>> > > > > On Fri, Jan 18, 2013 at 12:42 PM, Dmitry Kan <
>> solrexp...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > > > that we actually require the count of the sentences inside
>> > > > > > each document where the hits were found.
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Sincerely yours
>> > > > > Mikhail Khludnev
>> > > > > Principal Engineer,
>> > > > > Grid Dynamics
>> > > > >
>> > > > > <http://www.griddynamics.com>
>> > > > >  <mkhlud...@griddynamics.com>
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mkhlud...@griddynamics.com>
>>
>
>

Re: access matched token ids in the FacetComponent?

Reply via email to