One more question here - is this topic more appropriate to a different list?
On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis <dansm...@gmail.com> wrote: > I have now come to the task of estimating man-days to add "Blended Search > Results" to Apache Solr. The argument has been made that this is not > desirable (see Jonathan Rochkind's blog entries on Bento search with > blacklight). But the estimate remains. No estimate is worth much > without a design. So, I am come to the difficult of estimating this > without having an in-depth knowledge of the Apache core. Here is my > design, likely imperfect, as it stands. > > - Configure a core specific to each search source (local or remote) > - On cores that index remote content, implement a periodic delete > query that deletes documents whose timestamp is too old > - Implement a custom requestHandler for the "remote" cores that goes > out and queries the remote source. For each result in the top N > (configurable), it computes an id that is stable (e.g. it is based on the > remote resource URL, doi, or hash of data returned). It uses that id to > look-up the document in the lucene database. If the data is not there, it > updates the lucene core and sets a flag that commit is required. Once it > is done, it commits if needed. > - Configure a core that uses a custom SearchComponent to call the > requestHandler that goes and gets new documents and commits them. Since > the cores for remote content are different cores, they can restart their > searcher at this point if any commit is needed. The custom > SearchComponent will wait for commit and reload to be completed. Then, > search continues uses the other cores as "shards". > - Auto-warming on this will assure that the most recently requested > data is present. > > It will, of course, be very slow a good part of the time. > > Erik and others, I need to know whether this design has legs and what > other alternatives I might consider. > > > > On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> The lack of global TF/IDF has been answered in the past, >> in the sharded case, by "usually you have similar enough >> stats that it doesn't matter". This pre-supposes a fairly >> evenly distributed set of documents. >> >> But if you're talking about federated search across different >> types of documents, then what would you "rescore" with? >> How would you even consider scoring docs that are somewhat/ >> totally different? Think magazine articles an meta-data associated >> with pictures. >> >> What I've usually found is that one can use grouping to show >> the top N of a variety of results. Or show tabs with different >> types. Or have the app intelligently combine the different types >> of documents in a way that "makes sense". But I don't know >> how you'd just get "the right thing" to happen with some kind >> of scoring magic. >> >> Best >> Erick >> >> >> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <dansm...@gmail.com> wrote: >> >>> I've thought about it, and I have no time to really do a meta-search >>> during >>> evaluation. What I need to do is to create a single core that contains >>> both of my data sets, and then describe the architecture that would be >>> required to do blended results, with liberal estimates. >>> >>> From the perspective of evaluation, I need to understand whether any of >>> the >>> solutions to better ranking in the absence of global IDF have been >>> explored? I suspect that one could retrieve a much larger than N set >>> of >>> results from a set of shards, re-score in some way that doesn't require >>> IDF, e.g. storing both results in the same priority queue and >>> *re-scoring* >>> before *re-ranking*. >>> >>> The other way to do this would be to have a custom SearchHandler that >>> works >>> differently - it performs the query, retries all results deemed relevant >>> by >>> another engine, adds them to the Lucene index, and then performs the >>> query >>> again in the standard way. This would be quite slow, but perhaps useful >>> as a way to evaluate my method. >>> >>> I still welcome any suggestions on how such a SearchHandler could be >>> implemented. >>> >> >> >