One more question here - is this topic more appropriate to a different list?


On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis <dansm...@gmail.com> wrote:

> I have now come to the task of estimating man-days to add "Blended Search
> Results" to Apache Solr.   The argument has been made that this is not
> desirable (see Jonathan Rochkind's blog entries on Bento search with
> blacklight).   But the estimate remains.    No estimate is worth much
> without a design.   So, I am come to the difficult of estimating this
> without having an in-depth knowledge of the Apache core.   Here is my
> design, likely imperfect, as it stands.
>
>    - Configure a core specific to each search source (local or remote)
>    - On cores that index remote content, implement a periodic delete
>    query that deletes documents whose timestamp is too old
>    - Implement a custom requestHandler for the "remote" cores that goes
>    out and queries the remote source.   For each result in the top N
>    (configurable), it computes an id that is stable (e.g. it is based on the
>    remote resource URL, doi, or hash of data returned).   It uses that id to
>    look-up the document in the lucene database.   If the data is not there, it
>    updates the lucene core and sets a flag that commit is required.   Once it
>    is done, it commits if needed.
>    - Configure a core that uses a custom SearchComponent to call the
>    requestHandler that goes and gets new documents and commits them.   Since
>    the cores for remote content are different cores, they can restart their
>    searcher at this point if any commit is needed.   The custom
>    SearchComponent will wait for commit and reload to be completed.   Then,
>    search continues uses the other cores as "shards".
>    - Auto-warming on this will assure that the most recently requested
>    data is present.
>
> It will, of course, be very slow a good part of the time.
>
> Erik and others, I need to know whether this design has legs and what
> other alternatives I might consider.
>
>
>
> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> The lack of global TF/IDF has been answered in the past,
>> in the sharded case, by "usually you have similar enough
>> stats that it doesn't matter". This pre-supposes a fairly
>> evenly distributed set of documents.
>>
>> But if you're talking about federated search across different
>> types of documents, then what would you "rescore" with?
>> How would you even consider scoring docs that are somewhat/
>> totally different? Think magazine articles an meta-data associated
>> with pictures.
>>
>> What I've usually found is that one can use grouping to show
>> the top N of a variety of results. Or show tabs with different
>> types. Or have the app intelligently combine the different types
>> of documents in a way that "makes sense". But I don't know
>> how you'd just get "the right thing" to happen with some kind
>> of scoring magic.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <dansm...@gmail.com> wrote:
>>
>>> I've thought about it, and I have no time to really do a meta-search
>>> during
>>> evaluation.  What I need to do is to create a single core that contains
>>> both of my data sets, and then describe the architecture that would be
>>> required to do blended results, with liberal estimates.
>>>
>>> From the perspective of evaluation, I need to understand whether any of
>>> the
>>> solutions to better ranking in the absence of global IDF have been
>>> explored?    I suspect that one could retrieve a much larger than N set
>>> of
>>> results from a set of shards, re-score in some way that doesn't require
>>> IDF, e.g. storing both results in the same priority queue and
>>> *re-scoring*
>>> before *re-ranking*.
>>>
>>> The other way to do this would be to have a custom SearchHandler that
>>> works
>>> differently - it performs the query, retries all results deemed relevant
>>> by
>>> another engine, adds them to the Lucene index, and then performs the
>>> query
>>> again in the standard way.   This would be quite slow, but perhaps useful
>>> as a way to evaluate my method.
>>>
>>> I still welcome any suggestions on how such a SearchHandler could be
>>> implemented.
>>>
>>
>>
>

Reply via email to